Message ID | a9aa42f6bc2739ab46ce015f749e15177f8730d6.1729028033.git.boris@bur.io (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | btrfs: do not clear read-only when adding sprout device | expand |
在 2024/10/16 08:08, Boris Burkov 写道: > If you follow the seed/sprout wiki, it suggests the following workflow: > > btrfstune -S 1 seed_dev > mount seed_dev mnt > btrfs device add sprout_dev > mount -o remount,rw mnt > > The first mount mounts the FS readonly, which results in not setting > BTRFS_FS_OPEN, and setting the readonly bit on the sb. The device add > somewhat surprisingly clears the readonly bit on the sb (though the > mount is still practically readonly, from the users perspective...). > Finally, the remount checks the readonly bit on the sb against the flag > and sees no change, so it does not run the code intended to run on > ro->rw transitions, leaving BTRFS_FS_OPEN unset. > > As a result, when the cleaner_kthread runs, it sees no BTRFS_FS_OPEN and > does no work. This results in leaking deleted snapshots until we run out > of space. > > I propose fixing it at the first departure from what feels reasonable: > when we clear the readonly bit on the sb during device add. > > A new fstest I have written reproduces the bug and confirms the fix. > > Signed-off-by: Boris Burkov <boris@bur.io> The fix looks good to me, small and keeps the super block ro flag consistent. IIRC the old behavior of sprout is, adding device will immediately mark the fs RW, which is a big surprise changing the RO/RW status. So the extra Rw remount requirement looks very reasonable to me. Reviewed-by: Qu Wenruo <wqu@suse.com> Thanks, Qu > --- > Note that this is a resend of an old unmerged fix: > https://lore.kernel.org/linux-btrfs/16c05d39566858bb8bc1e03bd19947cf2b601b98.1647906815.git.boris@bur.io/T/#u > Some other ideas for fixing it by modifying how we set BTRFS_FS_OPEN > were also explored but not merged around that time: > https://lore.kernel.org/linux-btrfs/cover.1654216941.git.anand.jain@oracle.com/ > > I don't have a strong preference, but I would really like to see this > trivial bug fixed. For what it is worth, we have been carrying this > patch internally at Meta since I first sent it with no incident. > --- > fs/btrfs/volumes.c | 4 ---- > 1 file changed, 4 deletions(-) > > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c > index dc9f54849f39..84e861dcb350 100644 > --- a/fs/btrfs/volumes.c > +++ b/fs/btrfs/volumes.c > @@ -2841,8 +2841,6 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path > set_blocksize(device->bdev_file, BTRFS_BDEV_BLOCKSIZE); > > if (seeding_dev) { > - btrfs_clear_sb_rdonly(sb); > - > /* GFP_KERNEL allocation must not be under device_list_mutex */ > seed_devices = btrfs_init_sprout(fs_info); > if (IS_ERR(seed_devices)) { > @@ -2985,8 +2983,6 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path > mutex_unlock(&fs_info->chunk_mutex); > mutex_unlock(&fs_info->fs_devices->device_list_mutex); > error_trans: > - if (seeding_dev) > - btrfs_set_sb_rdonly(sb); > if (trans) > btrfs_end_transaction(trans); > error_free_zone:
在 2024/10/16 08:30, Qu Wenruo 写道: > > > 在 2024/10/16 08:08, Boris Burkov 写道: >> If you follow the seed/sprout wiki, it suggests the following workflow: >> >> btrfstune -S 1 seed_dev >> mount seed_dev mnt >> btrfs device add sprout_dev >> mount -o remount,rw mnt >> >> The first mount mounts the FS readonly, which results in not setting >> BTRFS_FS_OPEN, and setting the readonly bit on the sb. The device add >> somewhat surprisingly clears the readonly bit on the sb (though the >> mount is still practically readonly, from the users perspective...). >> Finally, the remount checks the readonly bit on the sb against the flag >> and sees no change, so it does not run the code intended to run on >> ro->rw transitions, leaving BTRFS_FS_OPEN unset. >> >> As a result, when the cleaner_kthread runs, it sees no BTRFS_FS_OPEN and >> does no work. This results in leaking deleted snapshots until we run out >> of space. >> >> I propose fixing it at the first departure from what feels reasonable: >> when we clear the readonly bit on the sb during device add. >> >> A new fstest I have written reproduces the bug and confirms the fix. >> >> Signed-off-by: Boris Burkov <boris@bur.io> > > The fix looks good to me, small and keeps the super block ro flag > consistent. > > IIRC the old behavior of sprout is, adding device will immediately mark > the fs RW, which is a big surprise changing the RO/RW status. > > So the extra Rw remount requirement looks very reasonable to me. Forgot to mention, although it's a trivial change in the behavior, if we are determined to go this path, the man page of the "SEEDING DEVICE" chapter also need to be updated. Thanks, Qu > > Reviewed-by: Qu Wenruo <wqu@suse.com> > > Thanks, > Qu > >> --- >> Note that this is a resend of an old unmerged fix: >> https://lore.kernel.org/linux- >> btrfs/16c05d39566858bb8bc1e03bd19947cf2b601b98.1647906815.git.boris@bur.io/T/#u >> Some other ideas for fixing it by modifying how we set BTRFS_FS_OPEN >> were also explored but not merged around that time: >> https://lore.kernel.org/linux-btrfs/ >> cover.1654216941.git.anand.jain@oracle.com/ >> >> I don't have a strong preference, but I would really like to see this >> trivial bug fixed. For what it is worth, we have been carrying this >> patch internally at Meta since I first sent it with no incident. >> --- >> fs/btrfs/volumes.c | 4 ---- >> 1 file changed, 4 deletions(-) >> >> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c >> index dc9f54849f39..84e861dcb350 100644 >> --- a/fs/btrfs/volumes.c >> +++ b/fs/btrfs/volumes.c >> @@ -2841,8 +2841,6 @@ int btrfs_init_new_device(struct btrfs_fs_info >> *fs_info, const char *device_path >> set_blocksize(device->bdev_file, BTRFS_BDEV_BLOCKSIZE); >> >> if (seeding_dev) { >> - btrfs_clear_sb_rdonly(sb); >> - >> /* GFP_KERNEL allocation must not be under device_list_mutex */ >> seed_devices = btrfs_init_sprout(fs_info); >> if (IS_ERR(seed_devices)) { >> @@ -2985,8 +2983,6 @@ int btrfs_init_new_device(struct btrfs_fs_info >> *fs_info, const char *device_path >> mutex_unlock(&fs_info->chunk_mutex); >> mutex_unlock(&fs_info->fs_devices->device_list_mutex); >> error_trans: >> - if (seeding_dev) >> - btrfs_set_sb_rdonly(sb); >> if (trans) >> btrfs_end_transaction(trans); >> error_free_zone: > >
On Wed, Oct 16, 2024 at 08:42:14AM +1030, Qu Wenruo wrote: > > > 在 2024/10/16 08:30, Qu Wenruo 写道: > > > > > > 在 2024/10/16 08:08, Boris Burkov 写道: > > > If you follow the seed/sprout wiki, it suggests the following workflow: > > > > > > btrfstune -S 1 seed_dev > > > mount seed_dev mnt > > > btrfs device add sprout_dev > > > mount -o remount,rw mnt > > > > > > The first mount mounts the FS readonly, which results in not setting > > > BTRFS_FS_OPEN, and setting the readonly bit on the sb. The device add > > > somewhat surprisingly clears the readonly bit on the sb (though the > > > mount is still practically readonly, from the users perspective...). > > > Finally, the remount checks the readonly bit on the sb against the flag > > > and sees no change, so it does not run the code intended to run on > > > ro->rw transitions, leaving BTRFS_FS_OPEN unset. > > > > > > As a result, when the cleaner_kthread runs, it sees no BTRFS_FS_OPEN and > > > does no work. This results in leaking deleted snapshots until we run out > > > of space. > > > > > > I propose fixing it at the first departure from what feels reasonable: > > > when we clear the readonly bit on the sb during device add. > > > > > > A new fstest I have written reproduces the bug and confirms the fix. > > > > > > Signed-off-by: Boris Burkov <boris@bur.io> > > > > The fix looks good to me, small and keeps the super block ro flag > > consistent. > > > > IIRC the old behavior of sprout is, adding device will immediately mark > > the fs RW, which is a big surprise changing the RO/RW status. > > > > So the extra Rw remount requirement looks very reasonable to me. > > Forgot to mention, although it's a trivial change in the behavior, if we > are determined to go this path, the man page of the "SEEDING DEVICE" > chapter also need to be updated. I just checked the man page and everything there looks correct, at least to me. It quite clearly states that after the 'device add' the fs is ready to be remounted read-write (via cycle mount or remount). Please let me know if there is some specific update you want me to make that I missed, though! BTW, this patch does change the rdonly flag behavior, so I will update the test to look at that, as you suggested. > > Thanks, > Qu > > > > Reviewed-by: Qu Wenruo <wqu@suse.com> > > > > Thanks, > > Qu > > > > > --- > > > Note that this is a resend of an old unmerged fix: > > > https://lore.kernel.org/linux- > > > btrfs/16c05d39566858bb8bc1e03bd19947cf2b601b98.1647906815.git.boris@bur.io/T/#u > > > Some other ideas for fixing it by modifying how we set BTRFS_FS_OPEN > > > were also explored but not merged around that time: > > > https://lore.kernel.org/linux-btrfs/ > > > cover.1654216941.git.anand.jain@oracle.com/ > > > > > > I don't have a strong preference, but I would really like to see this > > > trivial bug fixed. For what it is worth, we have been carrying this > > > patch internally at Meta since I first sent it with no incident. > > > --- > > > fs/btrfs/volumes.c | 4 ---- > > > 1 file changed, 4 deletions(-) > > > > > > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c > > > index dc9f54849f39..84e861dcb350 100644 > > > --- a/fs/btrfs/volumes.c > > > +++ b/fs/btrfs/volumes.c > > > @@ -2841,8 +2841,6 @@ int btrfs_init_new_device(struct btrfs_fs_info > > > *fs_info, const char *device_path > > > set_blocksize(device->bdev_file, BTRFS_BDEV_BLOCKSIZE); > > > > > > if (seeding_dev) { > > > - btrfs_clear_sb_rdonly(sb); > > > - > > > /* GFP_KERNEL allocation must not be under device_list_mutex */ > > > seed_devices = btrfs_init_sprout(fs_info); > > > if (IS_ERR(seed_devices)) { > > > @@ -2985,8 +2983,6 @@ int btrfs_init_new_device(struct btrfs_fs_info > > > *fs_info, const char *device_path > > > mutex_unlock(&fs_info->chunk_mutex); > > > mutex_unlock(&fs_info->fs_devices->device_list_mutex); > > > error_trans: > > > - if (seeding_dev) > > > - btrfs_set_sb_rdonly(sb); > > > if (trans) > > > btrfs_end_transaction(trans); > > > error_free_zone: > > > > >
On 16/10/24 05:38, Boris Burkov wrote: > If you follow the seed/sprout wiki, it suggests the following workflow: > > btrfstune -S 1 seed_dev > mount seed_dev mnt > btrfs device add sprout_dev > mount -o remount,rw mnt > > The first mount mounts the FS readonly, which results in not setting > BTRFS_FS_OPEN, and setting the readonly bit on the sb. The device add > somewhat surprisingly clears the readonly bit on the sb (though the > mount is still practically readonly, from the users perspective...). > Finally, the remount checks the readonly bit on the sb against the flag > and sees no change, so it does not run the code intended to run on > ro->rw transitions, leaving BTRFS_FS_OPEN unset. > > As a result, when the cleaner_kthread runs, it sees no BTRFS_FS_OPEN and > does no work. This results in leaking deleted snapshots until we run out > of space. > > I propose fixing it at the first departure from what feels reasonable: > when we clear the readonly bit on the sb during device add. > > A new fstest I have written reproduces the bug and confirms the fix. > > Signed-off-by: Boris Burkov <boris@bur.io> > --- > Note that this is a resend of an old unmerged fix: > https://lore.kernel.org/linux-btrfs/16c05d39566858bb8bc1e03bd19947cf2b601b98.1647906815.git.boris@bur.io/T/#u > Some other ideas for fixing it by modifying how we set BTRFS_FS_OPEN > were also explored but not merged around that time: > https://lore.kernel.org/linux-btrfs/cover.1654216941.git.anand.jain@oracle.com/ > > I don't have a strong preference, but I would really like to see this > trivial bug fixed. For what it is worth, we have been carrying this > patch internally at Meta since I first sent it with no incident. > --- I remember fixing this before. I tested on 5.15, and the bug isn't there, but it’s back in 6.10, so something broke in between. We need to track it down. The original design (kernel 4.x and below) makes the filesystem switch to read-write mode after adding a sprout because: You can’t add a device to a normal read-only filesystem so with seed read-only mount is different. With a seed device, adding a writable device transforms it into a new read-write filesystem with a _new_ FSID and fs_devices. Logically, read-write at this stage makes sense, but I’m okay without it and in fact we had fixed this before, but a patch somewhere seems to have broken it again. (Demo below. :<x> is the return code from the 'run' command at https://github.com/asj/run.git) ----- 5.15.0-208.159.3.2.el9uek.x86_64 ---- $ mkfs.btrfs -fq /dev/loop0 :0 $ btrfstune -S1 /dev/loop0 :0 $ mount /dev/loop0 /btrfs :0 mount: /btrfs: WARNING: source write-protected, mounted read-only. $ cat /proc/self/mounts | grep btrfs :0 /dev/loop0 /btrfs btrfs ro,relatime,space_cache=v2,subvolid=5,subvol=/ 0 0 $ findmnt -o SOURCE,UUID /btrfs :0 SOURCE UUID /dev/loop0 64f21b87-4e4c-4786-b2cd-c09a5ccd2afa $ btrfs fi show -m :0 Label: none uuid: 64f21b87-4e4c-4786-b2cd-c09a5ccd2afa Total devices 1 FS bytes used 144.00KiB devid 1 size 3.00GiB used 536.00MiB path /dev/loop0 $ ls /sys/fs/btrfs :0 64f21b87-4e4c-4786-b2cd-c09a5ccd2afa features $ btrfs dev add -f /dev/loop1 /btrfs :0 # After adding the device, the path and UUID are different, # so it’s a new filesystem. (But, as I said, I’m fine with # keeping it read-only and needing remount,rw. $ cat /proc/self/mounts | grep btrfs :0 /dev/loop1 /btrfs btrfs ro,relatime,space_cache=v2,subvolid=5,subvol=/ 0 0 $ findmnt -o SOURCE,UUID /btrfs :0 SOURCE UUID /dev/loop1 948cea35-18db-45da-9ec8-3d46cb5f0413 $ btrfs fi show -m :0 Label: none uuid: 948cea35-18db-45da-9ec8-3d46cb5f0413 Total devices 2 FS bytes used 144.00KiB devid 1 size 3.00GiB used 520.00MiB path /dev/loop0 devid 2 size 3.00GiB used 576.00MiB path /dev/loop1 $ ls /sys/fs/btrfs :0 948cea35-18db-45da-9ec8-3d46cb5f0413 features --------- Thanks, Anand > fs/btrfs/volumes.c | 4 ---- > 1 file changed, 4 deletions(-) > > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c > index dc9f54849f39..84e861dcb350 100644 > --- a/fs/btrfs/volumes.c > +++ b/fs/btrfs/volumes.c > @@ -2841,8 +2841,6 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path > set_blocksize(device->bdev_file, BTRFS_BDEV_BLOCKSIZE); > > if (seeding_dev) { > - btrfs_clear_sb_rdonly(sb); > - > /* GFP_KERNEL allocation must not be under device_list_mutex */ > seed_devices = btrfs_init_sprout(fs_info); > if (IS_ERR(seed_devices)) { > @@ -2985,8 +2983,6 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path > mutex_unlock(&fs_info->chunk_mutex); > mutex_unlock(&fs_info->fs_devices->device_list_mutex); > error_trans: > - if (seeding_dev) > - btrfs_set_sb_rdonly(sb); > if (trans) > btrfs_end_transaction(trans); > error_free_zone:
On Thu, Oct 17, 2024 at 01:14:16AM +0800, Anand Jain wrote: > On 16/10/24 05:38, Boris Burkov wrote: > > If you follow the seed/sprout wiki, it suggests the following workflow: > > > > btrfstune -S 1 seed_dev > > mount seed_dev mnt > > btrfs device add sprout_dev > > mount -o remount,rw mnt > > > > > > > The first mount mounts the FS readonly, which results in not setting > > BTRFS_FS_OPEN, and setting the readonly bit on the sb. The device add > > somewhat surprisingly clears the readonly bit on the sb (though the > > mount is still practically readonly, from the users perspective...). > > Finally, the remount checks the readonly bit on the sb against the flag > > and sees no change, so it does not run the code intended to run on > > ro->rw transitions, leaving BTRFS_FS_OPEN unset. > > > > As a result, when the cleaner_kthread runs, it sees no BTRFS_FS_OPEN and > > does no work. This results in leaking deleted snapshots until we run out > > of space. > > > > I propose fixing it at the first departure from what feels reasonable: > > when we clear the readonly bit on the sb during device add. > > > > A new fstest I have written reproduces the bug and confirms the fix. > > > > Signed-off-by: Boris Burkov <boris@bur.io> > > --- > > Note that this is a resend of an old unmerged fix: > > https://lore.kernel.org/linux-btrfs/16c05d39566858bb8bc1e03bd19947cf2b601b98.1647906815.git.boris@bur.io/T/#u > > Some other ideas for fixing it by modifying how we set BTRFS_FS_OPEN > > were also explored but not merged around that time: > > https://lore.kernel.org/linux-btrfs/cover.1654216941.git.anand.jain@oracle.com/ > > > > I don't have a strong preference, but I would really like to see this > > trivial bug fixed. For what it is worth, we have been carrying this > > patch internally at Meta since I first sent it with no incident. > > --- > > > I remember fixing this before. I tested on 5.15, and the bug isn't > there, but it’s back in 6.10, so something broke in between. > We need to track it down. Thanks for weighing in again, and for re-testing on 5.15. That's interesting that it broke again. And sorry if I didn't follow the rdonly check patches properly and those did end up getting merged. Poor code archaeology on my part :) At any rate, I have pushed this patch into for-next for now, as I think it constitutes an improvement without breaking any documented behavior. If you look into what happened between 5.15 and 6.11 and want to back it out with a different fix, I will not be offended. If we also land the fstest I submitted, then hopefully future kernels will *not* be breaking this again! Thanks, Boris > > The original design (kernel 4.x and below) makes the filesystem switch > to read-write mode after adding a sprout because: > > You can’t add a device to a normal read-only filesystem > so with seed read-only mount is different. > With a seed device, adding a writable device transforms > it into a new read-write filesystem with a _new_ FSID and > fs_devices. Logically, read-write at this stage makes sense, > but I’m okay without it and in fact we had fixed this before, > but a patch somewhere seems to have broken it again. > > > (Demo below. :<x> is the return code from the 'run' command at > https://github.com/asj/run.git) > > > ----- 5.15.0-208.159.3.2.el9uek.x86_64 ---- > $ mkfs.btrfs -fq /dev/loop0 :0 > $ btrfstune -S1 /dev/loop0 :0 > $ mount /dev/loop0 /btrfs :0 > mount: /btrfs: WARNING: source write-protected, mounted read-only. > > $ cat /proc/self/mounts | grep btrfs :0 > /dev/loop0 /btrfs btrfs ro,relatime,space_cache=v2,subvolid=5,subvol=/ 0 0 > > $ findmnt -o SOURCE,UUID /btrfs :0 > SOURCE UUID > /dev/loop0 64f21b87-4e4c-4786-b2cd-c09a5ccd2afa > > $ btrfs fi show -m :0 > Label: none uuid: 64f21b87-4e4c-4786-b2cd-c09a5ccd2afa > Total devices 1 FS bytes used 144.00KiB > devid 1 size 3.00GiB used 536.00MiB path /dev/loop0 > > $ ls /sys/fs/btrfs :0 > 64f21b87-4e4c-4786-b2cd-c09a5ccd2afa > features > > $ btrfs dev add -f /dev/loop1 /btrfs :0 > > # After adding the device, the path and UUID are different, > # so it’s a new filesystem. (But, as I said, I’m fine with > # keeping it read-only and needing remount,rw. > > $ cat /proc/self/mounts | grep btrfs :0 > /dev/loop1 /btrfs btrfs ro,relatime,space_cache=v2,subvolid=5,subvol=/ 0 0 > > $ findmnt -o SOURCE,UUID /btrfs :0 > SOURCE UUID > /dev/loop1 948cea35-18db-45da-9ec8-3d46cb5f0413 > > $ btrfs fi show -m :0 > Label: none uuid: 948cea35-18db-45da-9ec8-3d46cb5f0413 > Total devices 2 FS bytes used 144.00KiB > devid 1 size 3.00GiB used 520.00MiB path /dev/loop0 > devid 2 size 3.00GiB used 576.00MiB path /dev/loop1 > > > $ ls /sys/fs/btrfs :0 > 948cea35-18db-45da-9ec8-3d46cb5f0413 > features > --------- > > > Thanks, Anand > > > fs/btrfs/volumes.c | 4 ---- > > 1 file changed, 4 deletions(-) > > > > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c > > index dc9f54849f39..84e861dcb350 100644 > > --- a/fs/btrfs/volumes.c > > +++ b/fs/btrfs/volumes.c > > @@ -2841,8 +2841,6 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path > > set_blocksize(device->bdev_file, BTRFS_BDEV_BLOCKSIZE); > > if (seeding_dev) { > > - btrfs_clear_sb_rdonly(sb); > > - > > /* GFP_KERNEL allocation must not be under device_list_mutex */ > > seed_devices = btrfs_init_sprout(fs_info); > > if (IS_ERR(seed_devices)) { > > @@ -2985,8 +2983,6 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path > > mutex_unlock(&fs_info->chunk_mutex); > > mutex_unlock(&fs_info->fs_devices->device_list_mutex); > > error_trans: > > - if (seeding_dev) > > - btrfs_set_sb_rdonly(sb); > > if (trans) > > btrfs_end_transaction(trans); > > error_free_zone: >
On Tue, Oct 15, 2024 at 02:38:34PM -0700, Boris Burkov wrote: > If you follow the seed/sprout wiki, it suggests the following workflow: > > btrfstune -S 1 seed_dev > mount seed_dev mnt > btrfs device add sprout_dev > mount -o remount,rw mnt > > The first mount mounts the FS readonly, which results in not setting > BTRFS_FS_OPEN, and setting the readonly bit on the sb. The device add > somewhat surprisingly clears the readonly bit on the sb (though the > mount is still practically readonly, from the users perspective...). > Finally, the remount checks the readonly bit on the sb against the flag > and sees no change, so it does not run the code intended to run on > ro->rw transitions, leaving BTRFS_FS_OPEN unset. > > As a result, when the cleaner_kthread runs, it sees no BTRFS_FS_OPEN and > does no work. This results in leaking deleted snapshots until we run out > of space. > > I propose fixing it at the first departure from what feels reasonable: > when we clear the readonly bit on the sb during device add. > > A new fstest I have written reproduces the bug and confirms the fix. > > Signed-off-by: Boris Burkov <boris@bur.io> > --- > Note that this is a resend of an old unmerged fix: > https://lore.kernel.org/linux-btrfs/16c05d39566858bb8bc1e03bd19947cf2b601b98.1647906815.git.boris@bur.io/T/#u > Some other ideas for fixing it by modifying how we set BTRFS_FS_OPEN > were also explored but not merged around that time: > https://lore.kernel.org/linux-btrfs/cover.1654216941.git.anand.jain@oracle.com/ > > I don't have a strong preference, but I would really like to see this > trivial bug fixed. For what it is worth, we have been carrying this > patch internally at Meta since I first sent it with no incident. We have an unresolved dispute about the fix and now the patch got to for-next within a few days because it got two reviews but not mine. The way you use it in Meta works for you but the fix is changing behaviour so we can either ignore everybody else relying on the old way or say that seeding is so niche that we don't care and see what we can do once we get some report.
On Thu, Oct 17, 2024 at 04:01:12PM +0200, David Sterba wrote: > On Tue, Oct 15, 2024 at 02:38:34PM -0700, Boris Burkov wrote: > > If you follow the seed/sprout wiki, it suggests the following workflow: > > > > btrfstune -S 1 seed_dev > > mount seed_dev mnt > > btrfs device add sprout_dev > > mount -o remount,rw mnt > > > > The first mount mounts the FS readonly, which results in not setting > > BTRFS_FS_OPEN, and setting the readonly bit on the sb. The device add > > somewhat surprisingly clears the readonly bit on the sb (though the > > mount is still practically readonly, from the users perspective...). > > Finally, the remount checks the readonly bit on the sb against the flag > > and sees no change, so it does not run the code intended to run on > > ro->rw transitions, leaving BTRFS_FS_OPEN unset. > > > > As a result, when the cleaner_kthread runs, it sees no BTRFS_FS_OPEN and > > does no work. This results in leaking deleted snapshots until we run out > > of space. > > > > I propose fixing it at the first departure from what feels reasonable: > > when we clear the readonly bit on the sb during device add. > > > > A new fstest I have written reproduces the bug and confirms the fix. > > > > Signed-off-by: Boris Burkov <boris@bur.io> > > --- > > Note that this is a resend of an old unmerged fix: > > https://lore.kernel.org/linux-btrfs/16c05d39566858bb8bc1e03bd19947cf2b601b98.1647906815.git.boris@bur.io/T/#u > > Some other ideas for fixing it by modifying how we set BTRFS_FS_OPEN > > were also explored but not merged around that time: > > https://lore.kernel.org/linux-btrfs/cover.1654216941.git.anand.jain@oracle.com/ > > > > I don't have a strong preference, but I would really like to see this > > trivial bug fixed. For what it is worth, we have been carrying this > > patch internally at Meta since I first sent it with no incident. > > We have an unresolved dispute about the fix and now the patch got to > for-next within a few days because it got two reviews but not mine. > The way you use it in Meta works for you but the fix is changing > behaviour so we can either ignore everybody else relying on the old > way or say that seeding is so niche that we don't care and see what we > can do once we get some report. Please feel free to remove it, and I am happy to review whatever other fix you think is best. Sorry for rushing, I just wanted to get it done and out of my head so I could move on to other issues. I assume your concern is that before this fix, the filesystem is actually read-write after the device add, which this patch breaks? My only argument against this is that the documentation has always said you needed to remount/cycle-mount after adding the sprout, so there is no fair reason to assume this would work. (In fact, it *doesn't*, the fs is once again in a unexpected, degraded, state...) But if existing LiveCD users are relying on this undocumented behavior, then I think you are right and it's a bad idea to break them. As long as my test (and presumably some fix) goes in and I don't have to carry this patch internally for two more years, then I am happy. Thanks, Boris
在 2024/10/17 03:44, Anand Jain 写道: > On 16/10/24 05:38, Boris Burkov wrote: >> If you follow the seed/sprout wiki, it suggests the following workflow: >> >> btrfstune -S 1 seed_dev >> mount seed_dev mnt >> btrfs device add sprout_dev >> mount -o remount,rw mnt >> > > > >> The first mount mounts the FS readonly, which results in not setting >> BTRFS_FS_OPEN, and setting the readonly bit on the sb. The device add >> somewhat surprisingly clears the readonly bit on the sb (though the >> mount is still practically readonly, from the users perspective...). >> Finally, the remount checks the readonly bit on the sb against the flag >> and sees no change, so it does not run the code intended to run on >> ro->rw transitions, leaving BTRFS_FS_OPEN unset. >> >> As a result, when the cleaner_kthread runs, it sees no BTRFS_FS_OPEN and >> does no work. This results in leaking deleted snapshots until we run out >> of space. >> >> I propose fixing it at the first departure from what feels reasonable: >> when we clear the readonly bit on the sb during device add. >> >> A new fstest I have written reproduces the bug and confirms the fix. >> >> Signed-off-by: Boris Burkov <boris@bur.io> >> --- >> Note that this is a resend of an old unmerged fix: >> https://lore.kernel.org/linux- >> btrfs/16c05d39566858bb8bc1e03bd19947cf2b601b98.1647906815.git.boris@bur.io/T/#u >> Some other ideas for fixing it by modifying how we set BTRFS_FS_OPEN >> were also explored but not merged around that time: >> https://lore.kernel.org/linux-btrfs/ >> cover.1654216941.git.anand.jain@oracle.com/ >> >> I don't have a strong preference, but I would really like to see this >> trivial bug fixed. For what it is worth, we have been carrying this >> patch internally at Meta since I first sent it with no incident. >> --- > > > I remember fixing this before. I tested on 5.15, and the bug isn't > there, but it’s back in 6.10, so something broke in between. > We need to track it down. > > The original design (kernel 4.x and below) makes the filesystem switch > to read-write mode after adding a sprout because: > > You can’t add a device to a normal read-only filesystem > so with seed read-only mount is different. > With a seed device, adding a writable device transforms > it into a new read-write filesystem with a _new_ FSID and > fs_devices. Logically, read-write at this stage makes sense, > but I’m okay without it and in fact we had fixed this before, > but a patch somewhere seems to have broken it again. > > > (Demo below. :<x> is the return code from the 'run' command at > https://github.com/asj/run.git) > > > ----- 5.15.0-208.159.3.2.el9uek.x86_64 ---- I also tried it on upstream kernel v5.15.94, the behavior is still the old changed to RW immediately after device add: [adam@btrfs-vm ~]$ uname -a Linux btrfs-vm 5.15.94-1-lts #1 SMP Wed, 15 Feb 2023 07:09:02 +0000 x86_64 GNU/Linux [adam@btrfs-vm ~]$ sudo mkfs.btrfs -f /dev/test/scratch1 > /dev/null [adam@btrfs-vm ~]$ sudo btrfstune -S 1 /dev/test/scratch1 [adam@btrfs-vm ~]$ sudo mount /dev/test/scratch1 /mnt/btrfs/ mount: /mnt/btrfs: WARNING: source write-protected, mounted read-only. [adam@btrfs-vm ~]$ sudo btrfs device add -f /dev/test/scratch2 /mnt/btrfs/ Performing full device TRIM /dev/test/scratch2 (10.00GiB) ... [adam@btrfs-vm ~]$ sudo touch /mnt/btrfs/file [adam@btrfs-vm ~]$ mount | grep mnt/btrfs /dev/mapper/test-scratch2 on /mnt/btrfs type btrfs (rw,relatime,space_cache=v2,subvolid=5,subvol=/) So it looks like it's some extra backports causing the behavior change. But I still strongly prefer to keep it RO. Even if it's a different fs under the hood, it still suddenly changes the RO/RW status of a mount point without letting the user to know. Thanks, Qu > $ mkfs.btrfs -fq /dev/loop0 :0 > $ btrfstune -S1 /dev/loop0 :0 > $ mount /dev/loop0 /btrfs :0 > mount: /btrfs: WARNING: source write-protected, mounted read-only. > > $ cat /proc/self/mounts | grep btrfs :0 > /dev/loop0 /btrfs btrfs ro,relatime,space_cache=v2,subvolid=5,subvol=/ 0 0 > > $ findmnt -o SOURCE,UUID /btrfs :0 > SOURCE UUID > /dev/loop0 64f21b87-4e4c-4786-b2cd-c09a5ccd2afa > > $ btrfs fi show -m :0 > Label: none uuid: 64f21b87-4e4c-4786-b2cd-c09a5ccd2afa > Total devices 1 FS bytes used 144.00KiB > devid 1 size 3.00GiB used 536.00MiB path /dev/loop0 > > $ ls /sys/fs/btrfs :0 > 64f21b87-4e4c-4786-b2cd-c09a5ccd2afa > features > > $ btrfs dev add -f /dev/loop1 /btrfs :0 > > # After adding the device, the path and UUID are different, > # so it’s a new filesystem. (But, as I said, I’m fine with > # keeping it read-only and needing remount,rw. > > $ cat /proc/self/mounts | grep btrfs :0 > /dev/loop1 /btrfs btrfs ro,relatime,space_cache=v2,subvolid=5,subvol=/ 0 0 > > $ findmnt -o SOURCE,UUID /btrfs :0 > SOURCE UUID > /dev/loop1 948cea35-18db-45da-9ec8-3d46cb5f0413 > > $ btrfs fi show -m :0 > Label: none uuid: 948cea35-18db-45da-9ec8-3d46cb5f0413 > Total devices 2 FS bytes used 144.00KiB > devid 1 size 3.00GiB used 520.00MiB path /dev/loop0 > devid 2 size 3.00GiB used 576.00MiB path /dev/loop1 > > > $ ls /sys/fs/btrfs :0 > 948cea35-18db-45da-9ec8-3d46cb5f0413 > features > --------- > > > Thanks, Anand > >> fs/btrfs/volumes.c | 4 ---- >> 1 file changed, 4 deletions(-) >> >> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c >> index dc9f54849f39..84e861dcb350 100644 >> --- a/fs/btrfs/volumes.c >> +++ b/fs/btrfs/volumes.c >> @@ -2841,8 +2841,6 @@ int btrfs_init_new_device(struct btrfs_fs_info >> *fs_info, const char *device_path >> set_blocksize(device->bdev_file, BTRFS_BDEV_BLOCKSIZE); >> if (seeding_dev) { >> - btrfs_clear_sb_rdonly(sb); >> - >> /* GFP_KERNEL allocation must not be under device_list_mutex */ >> seed_devices = btrfs_init_sprout(fs_info); >> if (IS_ERR(seed_devices)) { >> @@ -2985,8 +2983,6 @@ int btrfs_init_new_device(struct btrfs_fs_info >> *fs_info, const char *device_path >> mutex_unlock(&fs_info->chunk_mutex); >> mutex_unlock(&fs_info->fs_devices->device_list_mutex); >> error_trans: >> - if (seeding_dev) >> - btrfs_set_sb_rdonly(sb); >> if (trans) >> btrfs_end_transaction(trans); >> error_free_zone: > >
On 18/10/24 04:47, Qu Wenruo wrote: > > > 在 2024/10/17 03:44, Anand Jain 写道: >> On 16/10/24 05:38, Boris Burkov wrote: >>> If you follow the seed/sprout wiki, it suggests the following workflow: >>> >>> btrfstune -S 1 seed_dev >>> mount seed_dev mnt >>> btrfs device add sprout_dev >>> mount -o remount,rw mnt >>> >> >> >> >>> The first mount mounts the FS readonly, which results in not setting >>> BTRFS_FS_OPEN, and setting the readonly bit on the sb. The device add >>> somewhat surprisingly clears the readonly bit on the sb (though the >>> mount is still practically readonly, from the users perspective...). >>> Finally, the remount checks the readonly bit on the sb against the flag >>> and sees no change, so it does not run the code intended to run on >>> ro->rw transitions, leaving BTRFS_FS_OPEN unset. >>> >>> As a result, when the cleaner_kthread runs, it sees no BTRFS_FS_OPEN and >>> does no work. This results in leaking deleted snapshots until we run out >>> of space. >>> >>> I propose fixing it at the first departure from what feels reasonable: >>> when we clear the readonly bit on the sb during device add. >>> >>> A new fstest I have written reproduces the bug and confirms the fix. >>> >>> Signed-off-by: Boris Burkov <boris@bur.io> >>> --- >>> Note that this is a resend of an old unmerged fix: >>> https://lore.kernel.org/linux- >>> btrfs/16c05d39566858bb8bc1e03bd19947cf2b601b98.1647906815.git.boris@bur.io/T/#u >>> Some other ideas for fixing it by modifying how we set BTRFS_FS_OPEN >>> were also explored but not merged around that time: >>> https://lore.kernel.org/linux-btrfs/ >>> cover.1654216941.git.anand.jain@oracle.com/ >>> >>> I don't have a strong preference, but I would really like to see this >>> trivial bug fixed. For what it is worth, we have been carrying this >>> patch internally at Meta since I first sent it with no incident. >>> --- >> >> >> I remember fixing this before. I tested on 5.15, and the bug isn't >> there, but it’s back in 6.10, so something broke in between. >> We need to track it down. >> >> The original design (kernel 4.x and below) makes the filesystem switch >> to read-write mode after adding a sprout because: >> >> You can’t add a device to a normal read-only filesystem >> so with seed read-only mount is different. >> With a seed device, adding a writable device transforms >> it into a new read-write filesystem with a _new_ FSID and >> fs_devices. Logically, read-write at this stage makes sense, >> but I’m okay without it and in fact we had fixed this before, >> but a patch somewhere seems to have broken it again. >> >> >> (Demo below. :<x> is the return code from the 'run' command at >> https://github.com/asj/run.git) >> >> >> ----- 5.15.0-208.159.3.2.el9uek.x86_64 ---- > > I also tried it on upstream kernel v5.15.94, the behavior is still the > old changed to RW immediately after device add: > > [adam@btrfs-vm ~]$ uname -a > Linux btrfs-vm 5.15.94-1-lts #1 SMP Wed, 15 Feb 2023 07:09:02 +0000 > x86_64 GNU/Linux > [adam@btrfs-vm ~]$ sudo mkfs.btrfs -f /dev/test/scratch1 > /dev/null > [adam@btrfs-vm ~]$ sudo btrfstune -S 1 /dev/test/scratch1 > [adam@btrfs-vm ~]$ sudo mount /dev/test/scratch1 /mnt/btrfs/ > mount: /mnt/btrfs: WARNING: source write-protected, mounted read-only. > [adam@btrfs-vm ~]$ sudo btrfs device add -f /dev/test/scratch2 /mnt/btrfs/ > Performing full device TRIM /dev/test/scratch2 (10.00GiB) ... > [adam@btrfs-vm ~]$ sudo touch /mnt/btrfs/file > [adam@btrfs-vm ~]$ mount | grep mnt/btrfs > /dev/mapper/test-scratch2 on /mnt/btrfs type btrfs > (rw,relatime,space_cache=v2,subvolid=5,subvol=/) > > So it looks like it's some extra backports causing the behavior change. > Actually, it is caused by util-linux, specifically the libmount. v2.38 is good, but v2.39-rc1 is bad with the same kernel without the fix. Unfortunately, a bunch of libmount commits between these versions are not bisectable. So I have no specific commit but commits from 2b1db0951b9d to f07412a04ca8. > But I still strongly prefer to keep it RO. > Even if it's a different fs under the hood, it still suddenly changes > the RO/RW status of a mount point without letting the user to know. Just my perspective—typically, we add an RW device to make a seed filesystem writable. If one command can do it, that's great from ease of use pov. But I’m fine with RO; it’s cleaner. Thanks, Anand > > Thanks, > Qu > >> $ mkfs.btrfs -fq /dev/loop0 :0 >> $ btrfstune -S1 /dev/loop0 :0 >> $ mount /dev/loop0 /btrfs :0 >> mount: /btrfs: WARNING: source write-protected, mounted read-only. >> >> $ cat /proc/self/mounts | grep btrfs :0 >> /dev/loop0 /btrfs btrfs ro,relatime,space_cache=v2,subvolid=5,subvol=/ >> 0 0 >> >> $ findmnt -o SOURCE,UUID /btrfs :0 >> SOURCE UUID >> /dev/loop0 64f21b87-4e4c-4786-b2cd-c09a5ccd2afa >> >> $ btrfs fi show -m :0 >> Label: none uuid: 64f21b87-4e4c-4786-b2cd-c09a5ccd2afa >> Total devices 1 FS bytes used 144.00KiB >> devid 1 size 3.00GiB used 536.00MiB path /dev/loop0 >> >> $ ls /sys/fs/btrfs :0 >> 64f21b87-4e4c-4786-b2cd-c09a5ccd2afa >> features >> >> $ btrfs dev add -f /dev/loop1 /btrfs :0 >> >> # After adding the device, the path and UUID are different, >> # so it’s a new filesystem. (But, as I said, I’m fine with >> # keeping it read-only and needing remount,rw. >> >> $ cat /proc/self/mounts | grep btrfs :0 >> /dev/loop1 /btrfs btrfs ro,relatime,space_cache=v2,subvolid=5,subvol=/ >> 0 0 >> >> $ findmnt -o SOURCE,UUID /btrfs :0 >> SOURCE UUID >> /dev/loop1 948cea35-18db-45da-9ec8-3d46cb5f0413 >> >> $ btrfs fi show -m :0 >> Label: none uuid: 948cea35-18db-45da-9ec8-3d46cb5f0413 >> Total devices 2 FS bytes used 144.00KiB >> devid 1 size 3.00GiB used 520.00MiB path /dev/loop0 >> devid 2 size 3.00GiB used 576.00MiB path /dev/loop1 >> >> >> $ ls /sys/fs/btrfs :0 >> 948cea35-18db-45da-9ec8-3d46cb5f0413 >> features >> --------- >> >> >> Thanks, Anand >> >>> fs/btrfs/volumes.c | 4 ---- >>> 1 file changed, 4 deletions(-) >>> >>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c >>> index dc9f54849f39..84e861dcb350 100644 >>> --- a/fs/btrfs/volumes.c >>> +++ b/fs/btrfs/volumes.c >>> @@ -2841,8 +2841,6 @@ int btrfs_init_new_device(struct btrfs_fs_info >>> *fs_info, const char *device_path >>> set_blocksize(device->bdev_file, BTRFS_BDEV_BLOCKSIZE); >>> if (seeding_dev) { >>> - btrfs_clear_sb_rdonly(sb); >>> - >>> /* GFP_KERNEL allocation must not be under >>> device_list_mutex */ >>> seed_devices = btrfs_init_sprout(fs_info); >>> if (IS_ERR(seed_devices)) { >>> @@ -2985,8 +2983,6 @@ int btrfs_init_new_device(struct btrfs_fs_info >>> *fs_info, const char *device_path >>> mutex_unlock(&fs_info->chunk_mutex); >>> mutex_unlock(&fs_info->fs_devices->device_list_mutex); >>> error_trans: >>> - if (seeding_dev) >>> - btrfs_set_sb_rdonly(sb); >>> if (trans) >>> btrfs_end_transaction(trans); >>> error_free_zone: >> >> >
On Thu, Oct 17, 2024 at 09:41:59AM -0700, Boris Burkov wrote: > On Thu, Oct 17, 2024 at 04:01:12PM +0200, David Sterba wrote: > > On Tue, Oct 15, 2024 at 02:38:34PM -0700, Boris Burkov wrote: > > > If you follow the seed/sprout wiki, it suggests the following workflow: > > > > > > btrfstune -S 1 seed_dev > > > mount seed_dev mnt > > > btrfs device add sprout_dev > > > mount -o remount,rw mnt > > > > > > The first mount mounts the FS readonly, which results in not setting > > > BTRFS_FS_OPEN, and setting the readonly bit on the sb. The device add > > > somewhat surprisingly clears the readonly bit on the sb (though the > > > mount is still practically readonly, from the users perspective...). > > > Finally, the remount checks the readonly bit on the sb against the flag > > > and sees no change, so it does not run the code intended to run on > > > ro->rw transitions, leaving BTRFS_FS_OPEN unset. > > > > > > As a result, when the cleaner_kthread runs, it sees no BTRFS_FS_OPEN and > > > does no work. This results in leaking deleted snapshots until we run out > > > of space. > > > > > > I propose fixing it at the first departure from what feels reasonable: > > > when we clear the readonly bit on the sb during device add. > > > > > > A new fstest I have written reproduces the bug and confirms the fix. > > > > > > Signed-off-by: Boris Burkov <boris@bur.io> > > > --- > > > Note that this is a resend of an old unmerged fix: > > > https://lore.kernel.org/linux-btrfs/16c05d39566858bb8bc1e03bd19947cf2b601b98.1647906815.git.boris@bur.io/T/#u > > > Some other ideas for fixing it by modifying how we set BTRFS_FS_OPEN > > > were also explored but not merged around that time: > > > https://lore.kernel.org/linux-btrfs/cover.1654216941.git.anand.jain@oracle.com/ > > > > > > I don't have a strong preference, but I would really like to see this > > > trivial bug fixed. For what it is worth, we have been carrying this > > > patch internally at Meta since I first sent it with no incident. > > > > We have an unresolved dispute about the fix and now the patch got to > > for-next within a few days because it got two reviews but not mine. > > The way you use it in Meta works for you but the fix is changing > > behaviour so we can either ignore everybody else relying on the old > > way or say that seeding is so niche that we don't care and see what we > > can do once we get some report. > > Please feel free to remove it, and I am happy to review whatever other > fix you think is best. Sorry for rushing, I just wanted to get it done > and out of my head so I could move on to other issues. I'm concerned because change like that needs an announcement, documentation changes or eventually an optional change so both use cases are available. I haven't merged it since the last time you or somebody posted it because I don't see a way how to fix it without fixing the bug and not breaking the use case. > I assume your concern is that before this fix, the filesystem is actually > read-write after the device add, which this patch breaks? > > My only argument against this is that the documentation has always said > you needed to remount/cycle-mount after adding the sprout, so there is > no fair reason to assume this would work. (In fact, it *doesn't*, the fs > is once again in a unexpected, degraded, state...) > > But if existing LiveCD users are relying on this undocumented behavior, > then I think you are right and it's a bad idea to break them. The problem here is that we don't know how the feature is used, the documentation came much later than the feature. So I take it as that people rely on code, not documenation, even if there's an undocumented behaviour.
On Mon, Oct 21, 2024 at 08:56:51PM +0200, David Sterba wrote: > On Thu, Oct 17, 2024 at 09:41:59AM -0700, Boris Burkov wrote: > > On Thu, Oct 17, 2024 at 04:01:12PM +0200, David Sterba wrote: > > > On Tue, Oct 15, 2024 at 02:38:34PM -0700, Boris Burkov wrote: > > > > If you follow the seed/sprout wiki, it suggests the following workflow: > > > > > > > > btrfstune -S 1 seed_dev > > > > mount seed_dev mnt > > > > btrfs device add sprout_dev > > > > mount -o remount,rw mnt > > > > > > > > The first mount mounts the FS readonly, which results in not setting > > > > BTRFS_FS_OPEN, and setting the readonly bit on the sb. The device add > > > > somewhat surprisingly clears the readonly bit on the sb (though the > > > > mount is still practically readonly, from the users perspective...). > > > > Finally, the remount checks the readonly bit on the sb against the flag > > > > and sees no change, so it does not run the code intended to run on > > > > ro->rw transitions, leaving BTRFS_FS_OPEN unset. > > > > > > > > As a result, when the cleaner_kthread runs, it sees no BTRFS_FS_OPEN and > > > > does no work. This results in leaking deleted snapshots until we run out > > > > of space. > > > > > > > > I propose fixing it at the first departure from what feels reasonable: > > > > when we clear the readonly bit on the sb during device add. > > > > > > > > A new fstest I have written reproduces the bug and confirms the fix. > > > > > > > > Signed-off-by: Boris Burkov <boris@bur.io> > > > > --- > > > > Note that this is a resend of an old unmerged fix: > > > > https://lore.kernel.org/linux-btrfs/16c05d39566858bb8bc1e03bd19947cf2b601b98.1647906815.git.boris@bur.io/T/#u > > > > Some other ideas for fixing it by modifying how we set BTRFS_FS_OPEN > > > > were also explored but not merged around that time: > > > > https://lore.kernel.org/linux-btrfs/cover.1654216941.git.anand.jain@oracle.com/ > > > > > > > > I don't have a strong preference, but I would really like to see this > > > > trivial bug fixed. For what it is worth, we have been carrying this > > > > patch internally at Meta since I first sent it with no incident. > > > > > > We have an unresolved dispute about the fix and now the patch got to > > > for-next within a few days because it got two reviews but not mine. > > > The way you use it in Meta works for you but the fix is changing > > > behaviour so we can either ignore everybody else relying on the old > > > way or say that seeding is so niche that we don't care and see what we > > > can do once we get some report. > > > > Please feel free to remove it, and I am happy to review whatever other > > fix you think is best. Sorry for rushing, I just wanted to get it done > > and out of my head so I could move on to other issues. > > I'm concerned because change like that needs an announcement, > documentation changes or eventually an optional change so both use cases > are available. I haven't merged it since the last time you or somebody > posted it because I don't see a way how to fix it without fixing the bug > and not breaking the use case. > That's fair. However, I assume we agree we need some fix, somewhere. Anyone who is really currently relying on this and not cycle mounting after adding the device doesn't get a cleaner thread, which includes empty block group cleanup, autorelocation, deleting snapshots, and defrag. I think subvolume cleanup and block group cleanup are probably the worst to lose. So let's step back from me impulsively stuffing this in to for-next and do the right thing together :) Thinking of our options, in no particular order: 1a. just land the fix It's not great if a lot of people rely on the fs "working" after device add. Some of them might quickly find the proper steps in the docs, but this might confuse/upset people. 1b. land the fix but properly socialize it We can also make btrfs device add notice that it is seed sprout and add a loud message that you need to remount rw. If this requires skipping a merge or two while we socialize it to obvious or google-able users, that seems reasonable. 1c. land the fix and make btrfs-progs do the remount after the device add This is kind of weird, but does preserve the expected behavior. We could also add some seed sprout specific flags or commands as the "main" way to do it. 2. have the device add detect the seed sprout case and really make the fs read-write If we are committed to the behavior, and don't believe in a userspace only fix, then it could be possible to make the current path which simply clears the ro bit also invoke some or all of the remount rw logic. 3. do nothing (remove the fix from for-next) I keep carrying an internal patch forever, upstream stays in danger of running an fs without a cleaner. If no one runs a long-lived fs off a live cd for long enough to care, maybe it is ok, too. But stuff like this is out there in google, for example: https://blog.elastocloud.org/2022/11/btrfs-seed-devices-for-ab-system-updates.html That's all I can think of, but I'm definitely open to other ideas! Thanks, Boris > > I assume your concern is that before this fix, the filesystem is actually > > read-write after the device add, which this patch breaks? > > > > My only argument against this is that the documentation has always said > > you needed to remount/cycle-mount after adding the sprout, so there is > > no fair reason to assume this would work. (In fact, it *doesn't*, the fs > > is once again in a unexpected, degraded, state...) > > > > But if existing LiveCD users are relying on this undocumented behavior, > > then I think you are right and it's a bad idea to break them. > > The problem here is that we don't know how the feature is used, the > documentation came much later than the feature. So I take it as that > people rely on code, not documenation, even if there's an undocumented > behaviour.
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index dc9f54849f39..84e861dcb350 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -2841,8 +2841,6 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path set_blocksize(device->bdev_file, BTRFS_BDEV_BLOCKSIZE); if (seeding_dev) { - btrfs_clear_sb_rdonly(sb); - /* GFP_KERNEL allocation must not be under device_list_mutex */ seed_devices = btrfs_init_sprout(fs_info); if (IS_ERR(seed_devices)) { @@ -2985,8 +2983,6 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path mutex_unlock(&fs_info->chunk_mutex); mutex_unlock(&fs_info->fs_devices->device_list_mutex); error_trans: - if (seeding_dev) - btrfs_set_sb_rdonly(sb); if (trans) btrfs_end_transaction(trans); error_free_zone:
If you follow the seed/sprout wiki, it suggests the following workflow: btrfstune -S 1 seed_dev mount seed_dev mnt btrfs device add sprout_dev mount -o remount,rw mnt The first mount mounts the FS readonly, which results in not setting BTRFS_FS_OPEN, and setting the readonly bit on the sb. The device add somewhat surprisingly clears the readonly bit on the sb (though the mount is still practically readonly, from the users perspective...). Finally, the remount checks the readonly bit on the sb against the flag and sees no change, so it does not run the code intended to run on ro->rw transitions, leaving BTRFS_FS_OPEN unset. As a result, when the cleaner_kthread runs, it sees no BTRFS_FS_OPEN and does no work. This results in leaking deleted snapshots until we run out of space. I propose fixing it at the first departure from what feels reasonable: when we clear the readonly bit on the sb during device add. A new fstest I have written reproduces the bug and confirms the fix. Signed-off-by: Boris Burkov <boris@bur.io> --- Note that this is a resend of an old unmerged fix: https://lore.kernel.org/linux-btrfs/16c05d39566858bb8bc1e03bd19947cf2b601b98.1647906815.git.boris@bur.io/T/#u Some other ideas for fixing it by modifying how we set BTRFS_FS_OPEN were also explored but not merged around that time: https://lore.kernel.org/linux-btrfs/cover.1654216941.git.anand.jain@oracle.com/ I don't have a strong preference, but I would really like to see this trivial bug fixed. For what it is worth, we have been carrying this patch internally at Meta since I first sent it with no incident. --- fs/btrfs/volumes.c | 4 ---- 1 file changed, 4 deletions(-)