diff mbox series

btrfs: do not clear read-only when adding sprout device

Message ID a9aa42f6bc2739ab46ce015f749e15177f8730d6.1729028033.git.boris@bur.io (mailing list archive)
State New, archived
Headers show
Series btrfs: do not clear read-only when adding sprout device | expand

Commit Message

Boris Burkov Oct. 15, 2024, 9:38 p.m. UTC
If you follow the seed/sprout wiki, it suggests the following workflow:

btrfstune -S 1 seed_dev
mount seed_dev mnt
btrfs device add sprout_dev
mount -o remount,rw mnt

The first mount mounts the FS readonly, which results in not setting
BTRFS_FS_OPEN, and setting the readonly bit on the sb. The device add
somewhat surprisingly clears the readonly bit on the sb (though the
mount is still practically readonly, from the users perspective...).
Finally, the remount checks the readonly bit on the sb against the flag
and sees no change, so it does not run the code intended to run on
ro->rw transitions, leaving BTRFS_FS_OPEN unset.

As a result, when the cleaner_kthread runs, it sees no BTRFS_FS_OPEN and
does no work. This results in leaking deleted snapshots until we run out
of space.

I propose fixing it at the first departure from what feels reasonable:
when we clear the readonly bit on the sb during device add.

A new fstest I have written reproduces the bug and confirms the fix.

Signed-off-by: Boris Burkov <boris@bur.io>
---
Note that this is a resend of an old unmerged fix:
https://lore.kernel.org/linux-btrfs/16c05d39566858bb8bc1e03bd19947cf2b601b98.1647906815.git.boris@bur.io/T/#u
Some other ideas for fixing it by modifying how we set BTRFS_FS_OPEN
were also explored but not merged around that time:
https://lore.kernel.org/linux-btrfs/cover.1654216941.git.anand.jain@oracle.com/

I don't have a strong preference, but I would really like to see this
trivial bug fixed. For what it is worth, we have been carrying this
patch internally at Meta since I first sent it with no incident.
---
 fs/btrfs/volumes.c | 4 ----
 1 file changed, 4 deletions(-)

Comments

Qu Wenruo Oct. 15, 2024, 10 p.m. UTC | #1
在 2024/10/16 08:08, Boris Burkov 写道:
> If you follow the seed/sprout wiki, it suggests the following workflow:
>
> btrfstune -S 1 seed_dev
> mount seed_dev mnt
> btrfs device add sprout_dev
> mount -o remount,rw mnt
>
> The first mount mounts the FS readonly, which results in not setting
> BTRFS_FS_OPEN, and setting the readonly bit on the sb. The device add
> somewhat surprisingly clears the readonly bit on the sb (though the
> mount is still practically readonly, from the users perspective...).
> Finally, the remount checks the readonly bit on the sb against the flag
> and sees no change, so it does not run the code intended to run on
> ro->rw transitions, leaving BTRFS_FS_OPEN unset.
>
> As a result, when the cleaner_kthread runs, it sees no BTRFS_FS_OPEN and
> does no work. This results in leaking deleted snapshots until we run out
> of space.
>
> I propose fixing it at the first departure from what feels reasonable:
> when we clear the readonly bit on the sb during device add.
>
> A new fstest I have written reproduces the bug and confirms the fix.
>
> Signed-off-by: Boris Burkov <boris@bur.io>

The fix looks good to me, small and keeps the super block ro flag
consistent.

IIRC the old behavior of sprout is, adding device will immediately mark
the fs RW, which is a big surprise changing the RO/RW status.

So the extra Rw remount requirement looks very reasonable to me.

Reviewed-by: Qu Wenruo <wqu@suse.com>

Thanks,
Qu

> ---
> Note that this is a resend of an old unmerged fix:
> https://lore.kernel.org/linux-btrfs/16c05d39566858bb8bc1e03bd19947cf2b601b98.1647906815.git.boris@bur.io/T/#u
> Some other ideas for fixing it by modifying how we set BTRFS_FS_OPEN
> were also explored but not merged around that time:
> https://lore.kernel.org/linux-btrfs/cover.1654216941.git.anand.jain@oracle.com/
>
> I don't have a strong preference, but I would really like to see this
> trivial bug fixed. For what it is worth, we have been carrying this
> patch internally at Meta since I first sent it with no incident.
> ---
>   fs/btrfs/volumes.c | 4 ----
>   1 file changed, 4 deletions(-)
>
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index dc9f54849f39..84e861dcb350 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -2841,8 +2841,6 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
>   	set_blocksize(device->bdev_file, BTRFS_BDEV_BLOCKSIZE);
>
>   	if (seeding_dev) {
> -		btrfs_clear_sb_rdonly(sb);
> -
>   		/* GFP_KERNEL allocation must not be under device_list_mutex */
>   		seed_devices = btrfs_init_sprout(fs_info);
>   		if (IS_ERR(seed_devices)) {
> @@ -2985,8 +2983,6 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
>   	mutex_unlock(&fs_info->chunk_mutex);
>   	mutex_unlock(&fs_info->fs_devices->device_list_mutex);
>   error_trans:
> -	if (seeding_dev)
> -		btrfs_set_sb_rdonly(sb);
>   	if (trans)
>   		btrfs_end_transaction(trans);
>   error_free_zone:
Qu Wenruo Oct. 15, 2024, 10:12 p.m. UTC | #2
在 2024/10/16 08:30, Qu Wenruo 写道:
>
>
> 在 2024/10/16 08:08, Boris Burkov 写道:
>> If you follow the seed/sprout wiki, it suggests the following workflow:
>>
>> btrfstune -S 1 seed_dev
>> mount seed_dev mnt
>> btrfs device add sprout_dev
>> mount -o remount,rw mnt
>>
>> The first mount mounts the FS readonly, which results in not setting
>> BTRFS_FS_OPEN, and setting the readonly bit on the sb. The device add
>> somewhat surprisingly clears the readonly bit on the sb (though the
>> mount is still practically readonly, from the users perspective...).
>> Finally, the remount checks the readonly bit on the sb against the flag
>> and sees no change, so it does not run the code intended to run on
>> ro->rw transitions, leaving BTRFS_FS_OPEN unset.
>>
>> As a result, when the cleaner_kthread runs, it sees no BTRFS_FS_OPEN and
>> does no work. This results in leaking deleted snapshots until we run out
>> of space.
>>
>> I propose fixing it at the first departure from what feels reasonable:
>> when we clear the readonly bit on the sb during device add.
>>
>> A new fstest I have written reproduces the bug and confirms the fix.
>>
>> Signed-off-by: Boris Burkov <boris@bur.io>
>
> The fix looks good to me, small and keeps the super block ro flag
> consistent.
>
> IIRC the old behavior of sprout is, adding device will immediately mark
> the fs RW, which is a big surprise changing the RO/RW status.
>
> So the extra Rw remount requirement looks very reasonable to me.

Forgot to mention, although it's a trivial change in the behavior, if we
are determined to go this path, the man page of the "SEEDING DEVICE"
chapter also need to be updated.

Thanks,
Qu
>
> Reviewed-by: Qu Wenruo <wqu@suse.com>
>
> Thanks,
> Qu
>
>> ---
>> Note that this is a resend of an old unmerged fix:
>> https://lore.kernel.org/linux-
>> btrfs/16c05d39566858bb8bc1e03bd19947cf2b601b98.1647906815.git.boris@bur.io/T/#u
>> Some other ideas for fixing it by modifying how we set BTRFS_FS_OPEN
>> were also explored but not merged around that time:
>> https://lore.kernel.org/linux-btrfs/
>> cover.1654216941.git.anand.jain@oracle.com/
>>
>> I don't have a strong preference, but I would really like to see this
>> trivial bug fixed. For what it is worth, we have been carrying this
>> patch internally at Meta since I first sent it with no incident.
>> ---
>>   fs/btrfs/volumes.c | 4 ----
>>   1 file changed, 4 deletions(-)
>>
>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>> index dc9f54849f39..84e861dcb350 100644
>> --- a/fs/btrfs/volumes.c
>> +++ b/fs/btrfs/volumes.c
>> @@ -2841,8 +2841,6 @@ int btrfs_init_new_device(struct btrfs_fs_info
>> *fs_info, const char *device_path
>>       set_blocksize(device->bdev_file, BTRFS_BDEV_BLOCKSIZE);
>>
>>       if (seeding_dev) {
>> -        btrfs_clear_sb_rdonly(sb);
>> -
>>           /* GFP_KERNEL allocation must not be under device_list_mutex */
>>           seed_devices = btrfs_init_sprout(fs_info);
>>           if (IS_ERR(seed_devices)) {
>> @@ -2985,8 +2983,6 @@ int btrfs_init_new_device(struct btrfs_fs_info
>> *fs_info, const char *device_path
>>       mutex_unlock(&fs_info->chunk_mutex);
>>       mutex_unlock(&fs_info->fs_devices->device_list_mutex);
>>   error_trans:
>> -    if (seeding_dev)
>> -        btrfs_set_sb_rdonly(sb);
>>       if (trans)
>>           btrfs_end_transaction(trans);
>>   error_free_zone:
>
>
Boris Burkov Oct. 15, 2024, 11:23 p.m. UTC | #3
On Wed, Oct 16, 2024 at 08:42:14AM +1030, Qu Wenruo wrote:
> 
> 
> 在 2024/10/16 08:30, Qu Wenruo 写道:
> > 
> > 
> > 在 2024/10/16 08:08, Boris Burkov 写道:
> > > If you follow the seed/sprout wiki, it suggests the following workflow:
> > > 
> > > btrfstune -S 1 seed_dev
> > > mount seed_dev mnt
> > > btrfs device add sprout_dev
> > > mount -o remount,rw mnt
> > > 
> > > The first mount mounts the FS readonly, which results in not setting
> > > BTRFS_FS_OPEN, and setting the readonly bit on the sb. The device add
> > > somewhat surprisingly clears the readonly bit on the sb (though the
> > > mount is still practically readonly, from the users perspective...).
> > > Finally, the remount checks the readonly bit on the sb against the flag
> > > and sees no change, so it does not run the code intended to run on
> > > ro->rw transitions, leaving BTRFS_FS_OPEN unset.
> > > 
> > > As a result, when the cleaner_kthread runs, it sees no BTRFS_FS_OPEN and
> > > does no work. This results in leaking deleted snapshots until we run out
> > > of space.
> > > 
> > > I propose fixing it at the first departure from what feels reasonable:
> > > when we clear the readonly bit on the sb during device add.
> > > 
> > > A new fstest I have written reproduces the bug and confirms the fix.
> > > 
> > > Signed-off-by: Boris Burkov <boris@bur.io>
> > 
> > The fix looks good to me, small and keeps the super block ro flag
> > consistent.
> > 
> > IIRC the old behavior of sprout is, adding device will immediately mark
> > the fs RW, which is a big surprise changing the RO/RW status.
> > 
> > So the extra Rw remount requirement looks very reasonable to me.
> 
> Forgot to mention, although it's a trivial change in the behavior, if we
> are determined to go this path, the man page of the "SEEDING DEVICE"
> chapter also need to be updated.

I just checked the man page and everything there looks correct, at least
to me. It quite clearly states that after the 'device add' the fs is
ready to be remounted read-write (via cycle mount or remount).

Please let me know if there is some specific update you want me to make
that I missed, though!

BTW, this patch does change the rdonly flag behavior, so I will update
the test to look at that, as you suggested.

> 
> Thanks,
> Qu
> > 
> > Reviewed-by: Qu Wenruo <wqu@suse.com>
> > 
> > Thanks,
> > Qu
> > 
> > > ---
> > > Note that this is a resend of an old unmerged fix:
> > > https://lore.kernel.org/linux-
> > > btrfs/16c05d39566858bb8bc1e03bd19947cf2b601b98.1647906815.git.boris@bur.io/T/#u
> > > Some other ideas for fixing it by modifying how we set BTRFS_FS_OPEN
> > > were also explored but not merged around that time:
> > > https://lore.kernel.org/linux-btrfs/
> > > cover.1654216941.git.anand.jain@oracle.com/
> > > 
> > > I don't have a strong preference, but I would really like to see this
> > > trivial bug fixed. For what it is worth, we have been carrying this
> > > patch internally at Meta since I first sent it with no incident.
> > > ---
> > >   fs/btrfs/volumes.c | 4 ----
> > >   1 file changed, 4 deletions(-)
> > > 
> > > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> > > index dc9f54849f39..84e861dcb350 100644
> > > --- a/fs/btrfs/volumes.c
> > > +++ b/fs/btrfs/volumes.c
> > > @@ -2841,8 +2841,6 @@ int btrfs_init_new_device(struct btrfs_fs_info
> > > *fs_info, const char *device_path
> > >       set_blocksize(device->bdev_file, BTRFS_BDEV_BLOCKSIZE);
> > > 
> > >       if (seeding_dev) {
> > > -        btrfs_clear_sb_rdonly(sb);
> > > -
> > >           /* GFP_KERNEL allocation must not be under device_list_mutex */
> > >           seed_devices = btrfs_init_sprout(fs_info);
> > >           if (IS_ERR(seed_devices)) {
> > > @@ -2985,8 +2983,6 @@ int btrfs_init_new_device(struct btrfs_fs_info
> > > *fs_info, const char *device_path
> > >       mutex_unlock(&fs_info->chunk_mutex);
> > >       mutex_unlock(&fs_info->fs_devices->device_list_mutex);
> > >   error_trans:
> > > -    if (seeding_dev)
> > > -        btrfs_set_sb_rdonly(sb);
> > >       if (trans)
> > >           btrfs_end_transaction(trans);
> > >   error_free_zone:
> > 
> > 
>
Anand Jain Oct. 16, 2024, 5:14 p.m. UTC | #4
On 16/10/24 05:38, Boris Burkov wrote:
> If you follow the seed/sprout wiki, it suggests the following workflow:
> 
> btrfstune -S 1 seed_dev
> mount seed_dev mnt
> btrfs device add sprout_dev
> mount -o remount,rw mnt
> 



> The first mount mounts the FS readonly, which results in not setting
> BTRFS_FS_OPEN, and setting the readonly bit on the sb. The device add
> somewhat surprisingly clears the readonly bit on the sb (though the
> mount is still practically readonly, from the users perspective...).
> Finally, the remount checks the readonly bit on the sb against the flag
> and sees no change, so it does not run the code intended to run on
> ro->rw transitions, leaving BTRFS_FS_OPEN unset.
> 
> As a result, when the cleaner_kthread runs, it sees no BTRFS_FS_OPEN and
> does no work. This results in leaking deleted snapshots until we run out
> of space.
> 
> I propose fixing it at the first departure from what feels reasonable:
> when we clear the readonly bit on the sb during device add.
> 
> A new fstest I have written reproduces the bug and confirms the fix.
> 
> Signed-off-by: Boris Burkov <boris@bur.io>
> ---
> Note that this is a resend of an old unmerged fix:
> https://lore.kernel.org/linux-btrfs/16c05d39566858bb8bc1e03bd19947cf2b601b98.1647906815.git.boris@bur.io/T/#u
> Some other ideas for fixing it by modifying how we set BTRFS_FS_OPEN
> were also explored but not merged around that time:
> https://lore.kernel.org/linux-btrfs/cover.1654216941.git.anand.jain@oracle.com/
> 
> I don't have a strong preference, but I would really like to see this
> trivial bug fixed. For what it is worth, we have been carrying this
> patch internally at Meta since I first sent it with no incident.
> ---


I remember fixing this before. I tested on 5.15, and the bug isn't
there, but it’s back in 6.10, so something broke in between.
We need to track it down.

The original design (kernel 4.x and below) makes the filesystem switch
to read-write mode after adding a sprout because:

    You can’t add a device to a normal read-only filesystem
  so with seed read-only mount is different.
    With a seed device, adding a writable device transforms
  it into a new read-write filesystem with a _new_ FSID and
  fs_devices. Logically, read-write at this stage makes sense,
  but I’m okay without it and in fact we had fixed this before,
  but a patch somewhere seems to have broken it again.


(Demo below. :<x> is the return code from the 'run' command at
  https://github.com/asj/run.git)


----- 5.15.0-208.159.3.2.el9uek.x86_64 ----
$ mkfs.btrfs -fq /dev/loop0 :0
$ btrfstune -S1 /dev/loop0 :0
$ mount /dev/loop0 /btrfs :0
mount: /btrfs: WARNING: source write-protected, mounted read-only.

$ cat /proc/self/mounts | grep btrfs :0
/dev/loop0 /btrfs btrfs ro,relatime,space_cache=v2,subvolid=5,subvol=/ 0 0

$ findmnt -o SOURCE,UUID /btrfs :0
SOURCE     UUID
/dev/loop0 64f21b87-4e4c-4786-b2cd-c09a5ccd2afa

$ btrfs fi show -m :0
Label: none  uuid: 64f21b87-4e4c-4786-b2cd-c09a5ccd2afa
	Total devices 1 FS bytes used 144.00KiB
	devid    1 size 3.00GiB used 536.00MiB path /dev/loop0

$ ls /sys/fs/btrfs :0
64f21b87-4e4c-4786-b2cd-c09a5ccd2afa
features

$ btrfs dev add -f /dev/loop1 /btrfs :0

# After adding the device, the path and UUID are different,
# so it’s a new filesystem. (But, as I said, I’m fine with
# keeping it read-only and needing remount,rw.

$ cat /proc/self/mounts | grep btrfs :0
/dev/loop1 /btrfs btrfs ro,relatime,space_cache=v2,subvolid=5,subvol=/ 0 0

$ findmnt -o SOURCE,UUID /btrfs :0
SOURCE     UUID
/dev/loop1 948cea35-18db-45da-9ec8-3d46cb5f0413

$ btrfs fi show -m :0
Label: none  uuid: 948cea35-18db-45da-9ec8-3d46cb5f0413
	Total devices 2 FS bytes used 144.00KiB
	devid    1 size 3.00GiB used 520.00MiB path /dev/loop0
	devid    2 size 3.00GiB used 576.00MiB path /dev/loop1


$ ls /sys/fs/btrfs :0
948cea35-18db-45da-9ec8-3d46cb5f0413
features
---------


Thanks, Anand

>   fs/btrfs/volumes.c | 4 ----
>   1 file changed, 4 deletions(-)
> 
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index dc9f54849f39..84e861dcb350 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -2841,8 +2841,6 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
>   	set_blocksize(device->bdev_file, BTRFS_BDEV_BLOCKSIZE);
>   
>   	if (seeding_dev) {
> -		btrfs_clear_sb_rdonly(sb);
> -
>   		/* GFP_KERNEL allocation must not be under device_list_mutex */
>   		seed_devices = btrfs_init_sprout(fs_info);
>   		if (IS_ERR(seed_devices)) {
> @@ -2985,8 +2983,6 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
>   	mutex_unlock(&fs_info->chunk_mutex);
>   	mutex_unlock(&fs_info->fs_devices->device_list_mutex);
>   error_trans:
> -	if (seeding_dev)
> -		btrfs_set_sb_rdonly(sb);
>   	if (trans)
>   		btrfs_end_transaction(trans);
>   error_free_zone:
Boris Burkov Oct. 16, 2024, 5:24 p.m. UTC | #5
On Thu, Oct 17, 2024 at 01:14:16AM +0800, Anand Jain wrote:
> On 16/10/24 05:38, Boris Burkov wrote:
> > If you follow the seed/sprout wiki, it suggests the following workflow:
> > 
> > btrfstune -S 1 seed_dev
> > mount seed_dev mnt
> > btrfs device add sprout_dev
> > mount -o remount,rw mnt
> > 
> 
> 
> 
> > The first mount mounts the FS readonly, which results in not setting
> > BTRFS_FS_OPEN, and setting the readonly bit on the sb. The device add
> > somewhat surprisingly clears the readonly bit on the sb (though the
> > mount is still practically readonly, from the users perspective...).
> > Finally, the remount checks the readonly bit on the sb against the flag
> > and sees no change, so it does not run the code intended to run on
> > ro->rw transitions, leaving BTRFS_FS_OPEN unset.
> > 
> > As a result, when the cleaner_kthread runs, it sees no BTRFS_FS_OPEN and
> > does no work. This results in leaking deleted snapshots until we run out
> > of space.
> > 
> > I propose fixing it at the first departure from what feels reasonable:
> > when we clear the readonly bit on the sb during device add.
> > 
> > A new fstest I have written reproduces the bug and confirms the fix.
> > 
> > Signed-off-by: Boris Burkov <boris@bur.io>
> > ---
> > Note that this is a resend of an old unmerged fix:
> > https://lore.kernel.org/linux-btrfs/16c05d39566858bb8bc1e03bd19947cf2b601b98.1647906815.git.boris@bur.io/T/#u
> > Some other ideas for fixing it by modifying how we set BTRFS_FS_OPEN
> > were also explored but not merged around that time:
> > https://lore.kernel.org/linux-btrfs/cover.1654216941.git.anand.jain@oracle.com/
> > 
> > I don't have a strong preference, but I would really like to see this
> > trivial bug fixed. For what it is worth, we have been carrying this
> > patch internally at Meta since I first sent it with no incident.
> > ---
> 
> 
> I remember fixing this before. I tested on 5.15, and the bug isn't
> there, but it’s back in 6.10, so something broke in between.
> We need to track it down.

Thanks for weighing in again, and for re-testing on 5.15. That's
interesting that it broke again. And sorry if I didn't follow the rdonly
check patches properly and those did end up getting merged. Poor code
archaeology on my part :)

At any rate, I have pushed this patch into for-next for now, as I
think it constitutes an improvement without breaking any documented
behavior. If you look into what happened between 5.15 and 6.11 and want
to back it out with a different fix, I will not be offended.

If we also land the fstest I submitted, then hopefully future kernels
will *not* be breaking this again!

Thanks,
Boris

> 
> The original design (kernel 4.x and below) makes the filesystem switch
> to read-write mode after adding a sprout because:
> 
>    You can’t add a device to a normal read-only filesystem
>  so with seed read-only mount is different.
>    With a seed device, adding a writable device transforms
>  it into a new read-write filesystem with a _new_ FSID and
>  fs_devices. Logically, read-write at this stage makes sense,
>  but I’m okay without it and in fact we had fixed this before,
>  but a patch somewhere seems to have broken it again.
> 
> 
> (Demo below. :<x> is the return code from the 'run' command at
>  https://github.com/asj/run.git)
> 
> 
> ----- 5.15.0-208.159.3.2.el9uek.x86_64 ----
> $ mkfs.btrfs -fq /dev/loop0 :0
> $ btrfstune -S1 /dev/loop0 :0
> $ mount /dev/loop0 /btrfs :0
> mount: /btrfs: WARNING: source write-protected, mounted read-only.
> 
> $ cat /proc/self/mounts | grep btrfs :0
> /dev/loop0 /btrfs btrfs ro,relatime,space_cache=v2,subvolid=5,subvol=/ 0 0
> 
> $ findmnt -o SOURCE,UUID /btrfs :0
> SOURCE     UUID
> /dev/loop0 64f21b87-4e4c-4786-b2cd-c09a5ccd2afa
> 
> $ btrfs fi show -m :0
> Label: none  uuid: 64f21b87-4e4c-4786-b2cd-c09a5ccd2afa
> 	Total devices 1 FS bytes used 144.00KiB
> 	devid    1 size 3.00GiB used 536.00MiB path /dev/loop0
> 
> $ ls /sys/fs/btrfs :0
> 64f21b87-4e4c-4786-b2cd-c09a5ccd2afa
> features
> 
> $ btrfs dev add -f /dev/loop1 /btrfs :0
> 
> # After adding the device, the path and UUID are different,
> # so it’s a new filesystem. (But, as I said, I’m fine with
> # keeping it read-only and needing remount,rw.
> 
> $ cat /proc/self/mounts | grep btrfs :0
> /dev/loop1 /btrfs btrfs ro,relatime,space_cache=v2,subvolid=5,subvol=/ 0 0
> 
> $ findmnt -o SOURCE,UUID /btrfs :0
> SOURCE     UUID
> /dev/loop1 948cea35-18db-45da-9ec8-3d46cb5f0413
> 
> $ btrfs fi show -m :0
> Label: none  uuid: 948cea35-18db-45da-9ec8-3d46cb5f0413
> 	Total devices 2 FS bytes used 144.00KiB
> 	devid    1 size 3.00GiB used 520.00MiB path /dev/loop0
> 	devid    2 size 3.00GiB used 576.00MiB path /dev/loop1
> 
> 
> $ ls /sys/fs/btrfs :0
> 948cea35-18db-45da-9ec8-3d46cb5f0413
> features
> ---------
> 
> 
> Thanks, Anand
> 
> >   fs/btrfs/volumes.c | 4 ----
> >   1 file changed, 4 deletions(-)
> > 
> > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> > index dc9f54849f39..84e861dcb350 100644
> > --- a/fs/btrfs/volumes.c
> > +++ b/fs/btrfs/volumes.c
> > @@ -2841,8 +2841,6 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
> >   	set_blocksize(device->bdev_file, BTRFS_BDEV_BLOCKSIZE);
> >   	if (seeding_dev) {
> > -		btrfs_clear_sb_rdonly(sb);
> > -
> >   		/* GFP_KERNEL allocation must not be under device_list_mutex */
> >   		seed_devices = btrfs_init_sprout(fs_info);
> >   		if (IS_ERR(seed_devices)) {
> > @@ -2985,8 +2983,6 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
> >   	mutex_unlock(&fs_info->chunk_mutex);
> >   	mutex_unlock(&fs_info->fs_devices->device_list_mutex);
> >   error_trans:
> > -	if (seeding_dev)
> > -		btrfs_set_sb_rdonly(sb);
> >   	if (trans)
> >   		btrfs_end_transaction(trans);
> >   error_free_zone:
>
David Sterba Oct. 17, 2024, 2:01 p.m. UTC | #6
On Tue, Oct 15, 2024 at 02:38:34PM -0700, Boris Burkov wrote:
> If you follow the seed/sprout wiki, it suggests the following workflow:
> 
> btrfstune -S 1 seed_dev
> mount seed_dev mnt
> btrfs device add sprout_dev
> mount -o remount,rw mnt
> 
> The first mount mounts the FS readonly, which results in not setting
> BTRFS_FS_OPEN, and setting the readonly bit on the sb. The device add
> somewhat surprisingly clears the readonly bit on the sb (though the
> mount is still practically readonly, from the users perspective...).
> Finally, the remount checks the readonly bit on the sb against the flag
> and sees no change, so it does not run the code intended to run on
> ro->rw transitions, leaving BTRFS_FS_OPEN unset.
> 
> As a result, when the cleaner_kthread runs, it sees no BTRFS_FS_OPEN and
> does no work. This results in leaking deleted snapshots until we run out
> of space.
> 
> I propose fixing it at the first departure from what feels reasonable:
> when we clear the readonly bit on the sb during device add.
> 
> A new fstest I have written reproduces the bug and confirms the fix.
> 
> Signed-off-by: Boris Burkov <boris@bur.io>
> ---
> Note that this is a resend of an old unmerged fix:
> https://lore.kernel.org/linux-btrfs/16c05d39566858bb8bc1e03bd19947cf2b601b98.1647906815.git.boris@bur.io/T/#u
> Some other ideas for fixing it by modifying how we set BTRFS_FS_OPEN
> were also explored but not merged around that time:
> https://lore.kernel.org/linux-btrfs/cover.1654216941.git.anand.jain@oracle.com/
> 
> I don't have a strong preference, but I would really like to see this
> trivial bug fixed. For what it is worth, we have been carrying this
> patch internally at Meta since I first sent it with no incident.

We have an unresolved dispute about the fix and now the patch got to
for-next within a few days because it got two reviews but not mine.
The way you use it in Meta works for you but the fix is changing
behaviour so we can either ignore everybody else relying on the old
way or say that seeding is so niche that we don't care and see what we
can do once we get some report.
Boris Burkov Oct. 17, 2024, 4:41 p.m. UTC | #7
On Thu, Oct 17, 2024 at 04:01:12PM +0200, David Sterba wrote:
> On Tue, Oct 15, 2024 at 02:38:34PM -0700, Boris Burkov wrote:
> > If you follow the seed/sprout wiki, it suggests the following workflow:
> > 
> > btrfstune -S 1 seed_dev
> > mount seed_dev mnt
> > btrfs device add sprout_dev
> > mount -o remount,rw mnt
> > 
> > The first mount mounts the FS readonly, which results in not setting
> > BTRFS_FS_OPEN, and setting the readonly bit on the sb. The device add
> > somewhat surprisingly clears the readonly bit on the sb (though the
> > mount is still practically readonly, from the users perspective...).
> > Finally, the remount checks the readonly bit on the sb against the flag
> > and sees no change, so it does not run the code intended to run on
> > ro->rw transitions, leaving BTRFS_FS_OPEN unset.
> > 
> > As a result, when the cleaner_kthread runs, it sees no BTRFS_FS_OPEN and
> > does no work. This results in leaking deleted snapshots until we run out
> > of space.
> > 
> > I propose fixing it at the first departure from what feels reasonable:
> > when we clear the readonly bit on the sb during device add.
> > 
> > A new fstest I have written reproduces the bug and confirms the fix.
> > 
> > Signed-off-by: Boris Burkov <boris@bur.io>
> > ---
> > Note that this is a resend of an old unmerged fix:
> > https://lore.kernel.org/linux-btrfs/16c05d39566858bb8bc1e03bd19947cf2b601b98.1647906815.git.boris@bur.io/T/#u
> > Some other ideas for fixing it by modifying how we set BTRFS_FS_OPEN
> > were also explored but not merged around that time:
> > https://lore.kernel.org/linux-btrfs/cover.1654216941.git.anand.jain@oracle.com/
> > 
> > I don't have a strong preference, but I would really like to see this
> > trivial bug fixed. For what it is worth, we have been carrying this
> > patch internally at Meta since I first sent it with no incident.
> 
> We have an unresolved dispute about the fix and now the patch got to
> for-next within a few days because it got two reviews but not mine.
> The way you use it in Meta works for you but the fix is changing
> behaviour so we can either ignore everybody else relying on the old
> way or say that seeding is so niche that we don't care and see what we
> can do once we get some report.

Please feel free to remove it, and I am happy to review whatever other
fix you think is best. Sorry for rushing, I just wanted to get it done
and out of my head so I could move on to other issues.

I assume your concern is that before this fix, the filesystem is actually
read-write after the device add, which this patch breaks?

My only argument against this is that the documentation has always said
you needed to remount/cycle-mount after adding the sprout, so there is
no fair reason to assume this would work. (In fact, it *doesn't*, the fs
is once again in a unexpected, degraded, state...)

But if existing LiveCD users are relying on this undocumented behavior,
then I think you are right and it's a bad idea to break them.

As long as my test (and presumably some fix) goes in and I don't have to
carry this patch internally for two more years, then I am happy.

Thanks,
Boris
Qu Wenruo Oct. 17, 2024, 8:47 p.m. UTC | #8
在 2024/10/17 03:44, Anand Jain 写道:
> On 16/10/24 05:38, Boris Burkov wrote:
>> If you follow the seed/sprout wiki, it suggests the following workflow:
>>
>> btrfstune -S 1 seed_dev
>> mount seed_dev mnt
>> btrfs device add sprout_dev
>> mount -o remount,rw mnt
>>
>
>
>
>> The first mount mounts the FS readonly, which results in not setting
>> BTRFS_FS_OPEN, and setting the readonly bit on the sb. The device add
>> somewhat surprisingly clears the readonly bit on the sb (though the
>> mount is still practically readonly, from the users perspective...).
>> Finally, the remount checks the readonly bit on the sb against the flag
>> and sees no change, so it does not run the code intended to run on
>> ro->rw transitions, leaving BTRFS_FS_OPEN unset.
>>
>> As a result, when the cleaner_kthread runs, it sees no BTRFS_FS_OPEN and
>> does no work. This results in leaking deleted snapshots until we run out
>> of space.
>>
>> I propose fixing it at the first departure from what feels reasonable:
>> when we clear the readonly bit on the sb during device add.
>>
>> A new fstest I have written reproduces the bug and confirms the fix.
>>
>> Signed-off-by: Boris Burkov <boris@bur.io>
>> ---
>> Note that this is a resend of an old unmerged fix:
>> https://lore.kernel.org/linux-
>> btrfs/16c05d39566858bb8bc1e03bd19947cf2b601b98.1647906815.git.boris@bur.io/T/#u
>> Some other ideas for fixing it by modifying how we set BTRFS_FS_OPEN
>> were also explored but not merged around that time:
>> https://lore.kernel.org/linux-btrfs/
>> cover.1654216941.git.anand.jain@oracle.com/
>>
>> I don't have a strong preference, but I would really like to see this
>> trivial bug fixed. For what it is worth, we have been carrying this
>> patch internally at Meta since I first sent it with no incident.
>> ---
>
>
> I remember fixing this before. I tested on 5.15, and the bug isn't
> there, but it’s back in 6.10, so something broke in between.
> We need to track it down.
>
> The original design (kernel 4.x and below) makes the filesystem switch
> to read-write mode after adding a sprout because:
>
>     You can’t add a device to a normal read-only filesystem
>   so with seed read-only mount is different.
>     With a seed device, adding a writable device transforms
>   it into a new read-write filesystem with a _new_ FSID and
>   fs_devices. Logically, read-write at this stage makes sense,
>   but I’m okay without it and in fact we had fixed this before,
>   but a patch somewhere seems to have broken it again.
>
>
> (Demo below. :<x> is the return code from the 'run' command at
>   https://github.com/asj/run.git)
>
>
> ----- 5.15.0-208.159.3.2.el9uek.x86_64 ----

I also tried it on upstream kernel v5.15.94, the behavior is still the
old changed to RW immediately after device add:

[adam@btrfs-vm ~]$ uname -a
Linux btrfs-vm 5.15.94-1-lts #1 SMP Wed, 15 Feb 2023 07:09:02 +0000
x86_64 GNU/Linux
[adam@btrfs-vm ~]$ sudo mkfs.btrfs  -f /dev/test/scratch1 > /dev/null
[adam@btrfs-vm ~]$ sudo btrfstune -S 1 /dev/test/scratch1
[adam@btrfs-vm ~]$ sudo mount /dev/test/scratch1  /mnt/btrfs/
mount: /mnt/btrfs: WARNING: source write-protected, mounted read-only.
[adam@btrfs-vm ~]$ sudo btrfs device add -f /dev/test/scratch2  /mnt/btrfs/
Performing full device TRIM /dev/test/scratch2 (10.00GiB) ...
[adam@btrfs-vm ~]$ sudo touch /mnt/btrfs/file
[adam@btrfs-vm ~]$ mount | grep mnt/btrfs
/dev/mapper/test-scratch2 on /mnt/btrfs type btrfs
(rw,relatime,space_cache=v2,subvolid=5,subvol=/)

So it looks like it's some extra backports causing the behavior change.

But I still strongly prefer to keep it RO.
Even if it's a different fs under the hood, it still suddenly changes
the RO/RW status of a mount point without letting the user to know.

Thanks,
Qu

> $ mkfs.btrfs -fq /dev/loop0 :0
> $ btrfstune -S1 /dev/loop0 :0
> $ mount /dev/loop0 /btrfs :0
> mount: /btrfs: WARNING: source write-protected, mounted read-only.
>
> $ cat /proc/self/mounts | grep btrfs :0
> /dev/loop0 /btrfs btrfs ro,relatime,space_cache=v2,subvolid=5,subvol=/ 0 0
>
> $ findmnt -o SOURCE,UUID /btrfs :0
> SOURCE     UUID
> /dev/loop0 64f21b87-4e4c-4786-b2cd-c09a5ccd2afa
>
> $ btrfs fi show -m :0
> Label: none  uuid: 64f21b87-4e4c-4786-b2cd-c09a5ccd2afa
>      Total devices 1 FS bytes used 144.00KiB
>      devid    1 size 3.00GiB used 536.00MiB path /dev/loop0
>
> $ ls /sys/fs/btrfs :0
> 64f21b87-4e4c-4786-b2cd-c09a5ccd2afa
> features
>
> $ btrfs dev add -f /dev/loop1 /btrfs :0
>
> # After adding the device, the path and UUID are different,
> # so it’s a new filesystem. (But, as I said, I’m fine with
> # keeping it read-only and needing remount,rw.
>
> $ cat /proc/self/mounts | grep btrfs :0
> /dev/loop1 /btrfs btrfs ro,relatime,space_cache=v2,subvolid=5,subvol=/ 0 0
>
> $ findmnt -o SOURCE,UUID /btrfs :0
> SOURCE     UUID
> /dev/loop1 948cea35-18db-45da-9ec8-3d46cb5f0413
>
> $ btrfs fi show -m :0
> Label: none  uuid: 948cea35-18db-45da-9ec8-3d46cb5f0413
>      Total devices 2 FS bytes used 144.00KiB
>      devid    1 size 3.00GiB used 520.00MiB path /dev/loop0
>      devid    2 size 3.00GiB used 576.00MiB path /dev/loop1
>
>
> $ ls /sys/fs/btrfs :0
> 948cea35-18db-45da-9ec8-3d46cb5f0413
> features
> ---------
>
>
> Thanks, Anand
>
>>   fs/btrfs/volumes.c | 4 ----
>>   1 file changed, 4 deletions(-)
>>
>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>> index dc9f54849f39..84e861dcb350 100644
>> --- a/fs/btrfs/volumes.c
>> +++ b/fs/btrfs/volumes.c
>> @@ -2841,8 +2841,6 @@ int btrfs_init_new_device(struct btrfs_fs_info
>> *fs_info, const char *device_path
>>       set_blocksize(device->bdev_file, BTRFS_BDEV_BLOCKSIZE);
>>       if (seeding_dev) {
>> -        btrfs_clear_sb_rdonly(sb);
>> -
>>           /* GFP_KERNEL allocation must not be under device_list_mutex */
>>           seed_devices = btrfs_init_sprout(fs_info);
>>           if (IS_ERR(seed_devices)) {
>> @@ -2985,8 +2983,6 @@ int btrfs_init_new_device(struct btrfs_fs_info
>> *fs_info, const char *device_path
>>       mutex_unlock(&fs_info->chunk_mutex);
>>       mutex_unlock(&fs_info->fs_devices->device_list_mutex);
>>   error_trans:
>> -    if (seeding_dev)
>> -        btrfs_set_sb_rdonly(sb);
>>       if (trans)
>>           btrfs_end_transaction(trans);
>>   error_free_zone:
>
>
Anand Jain Oct. 18, 2024, 11:54 a.m. UTC | #9
On 18/10/24 04:47, Qu Wenruo wrote:
> 
> 
> 在 2024/10/17 03:44, Anand Jain 写道:
>> On 16/10/24 05:38, Boris Burkov wrote:
>>> If you follow the seed/sprout wiki, it suggests the following workflow:
>>>
>>> btrfstune -S 1 seed_dev
>>> mount seed_dev mnt
>>> btrfs device add sprout_dev
>>> mount -o remount,rw mnt
>>>
>>
>>
>>
>>> The first mount mounts the FS readonly, which results in not setting
>>> BTRFS_FS_OPEN, and setting the readonly bit on the sb. The device add
>>> somewhat surprisingly clears the readonly bit on the sb (though the
>>> mount is still practically readonly, from the users perspective...).
>>> Finally, the remount checks the readonly bit on the sb against the flag
>>> and sees no change, so it does not run the code intended to run on
>>> ro->rw transitions, leaving BTRFS_FS_OPEN unset.
>>>
>>> As a result, when the cleaner_kthread runs, it sees no BTRFS_FS_OPEN and
>>> does no work. This results in leaking deleted snapshots until we run out
>>> of space.
>>>
>>> I propose fixing it at the first departure from what feels reasonable:
>>> when we clear the readonly bit on the sb during device add.
>>>
>>> A new fstest I have written reproduces the bug and confirms the fix.
>>>
>>> Signed-off-by: Boris Burkov <boris@bur.io>
>>> ---
>>> Note that this is a resend of an old unmerged fix:
>>> https://lore.kernel.org/linux-
>>> btrfs/16c05d39566858bb8bc1e03bd19947cf2b601b98.1647906815.git.boris@bur.io/T/#u
>>> Some other ideas for fixing it by modifying how we set BTRFS_FS_OPEN
>>> were also explored but not merged around that time:
>>> https://lore.kernel.org/linux-btrfs/
>>> cover.1654216941.git.anand.jain@oracle.com/
>>>
>>> I don't have a strong preference, but I would really like to see this
>>> trivial bug fixed. For what it is worth, we have been carrying this
>>> patch internally at Meta since I first sent it with no incident.
>>> ---
>>
>>
>> I remember fixing this before. I tested on 5.15, and the bug isn't
>> there, but it’s back in 6.10, so something broke in between.
>> We need to track it down.
>>
>> The original design (kernel 4.x and below) makes the filesystem switch
>> to read-write mode after adding a sprout because:
>>
>>     You can’t add a device to a normal read-only filesystem
>>   so with seed read-only mount is different.
>>     With a seed device, adding a writable device transforms
>>   it into a new read-write filesystem with a _new_ FSID and
>>   fs_devices. Logically, read-write at this stage makes sense,
>>   but I’m okay without it and in fact we had fixed this before,
>>   but a patch somewhere seems to have broken it again.
>>
>>
>> (Demo below. :<x> is the return code from the 'run' command at
>>   https://github.com/asj/run.git)
>>
>>
>> ----- 5.15.0-208.159.3.2.el9uek.x86_64 ----
> 
> I also tried it on upstream kernel v5.15.94, the behavior is still the
> old changed to RW immediately after device add:
> 
> [adam@btrfs-vm ~]$ uname -a
> Linux btrfs-vm 5.15.94-1-lts #1 SMP Wed, 15 Feb 2023 07:09:02 +0000
> x86_64 GNU/Linux
> [adam@btrfs-vm ~]$ sudo mkfs.btrfs  -f /dev/test/scratch1 > /dev/null
> [adam@btrfs-vm ~]$ sudo btrfstune -S 1 /dev/test/scratch1
> [adam@btrfs-vm ~]$ sudo mount /dev/test/scratch1  /mnt/btrfs/
> mount: /mnt/btrfs: WARNING: source write-protected, mounted read-only.
> [adam@btrfs-vm ~]$ sudo btrfs device add -f /dev/test/scratch2  /mnt/btrfs/
> Performing full device TRIM /dev/test/scratch2 (10.00GiB) ...
> [adam@btrfs-vm ~]$ sudo touch /mnt/btrfs/file
> [adam@btrfs-vm ~]$ mount | grep mnt/btrfs
> /dev/mapper/test-scratch2 on /mnt/btrfs type btrfs
> (rw,relatime,space_cache=v2,subvolid=5,subvol=/)
> 
> So it looks like it's some extra backports causing the behavior change.
> 

Actually, it is caused by util-linux, specifically the libmount.
v2.38 is good, but v2.39-rc1 is bad with the same kernel without
the fix.
Unfortunately, a bunch of libmount commits between these
versions are not bisectable.
So I have no specific commit but commits from 2b1db0951b9d
to f07412a04ca8.


> But I still strongly prefer to keep it RO.
> Even if it's a different fs under the hood, it still suddenly changes
> the RO/RW status of a mount point without letting the user to know.

Just my perspective—typically, we add an RW device to make a seed
filesystem writable. If one command can do it, that's great from
ease of use pov. But I’m fine with RO; it’s cleaner.


Thanks, Anand

> 
> Thanks,
> Qu
> 
>> $ mkfs.btrfs -fq /dev/loop0 :0
>> $ btrfstune -S1 /dev/loop0 :0
>> $ mount /dev/loop0 /btrfs :0
>> mount: /btrfs: WARNING: source write-protected, mounted read-only.
>>
>> $ cat /proc/self/mounts | grep btrfs :0
>> /dev/loop0 /btrfs btrfs ro,relatime,space_cache=v2,subvolid=5,subvol=/ 
>> 0 0
>>
>> $ findmnt -o SOURCE,UUID /btrfs :0
>> SOURCE     UUID
>> /dev/loop0 64f21b87-4e4c-4786-b2cd-c09a5ccd2afa
>>
>> $ btrfs fi show -m :0
>> Label: none  uuid: 64f21b87-4e4c-4786-b2cd-c09a5ccd2afa
>>      Total devices 1 FS bytes used 144.00KiB
>>      devid    1 size 3.00GiB used 536.00MiB path /dev/loop0
>>
>> $ ls /sys/fs/btrfs :0
>> 64f21b87-4e4c-4786-b2cd-c09a5ccd2afa
>> features
>>
>> $ btrfs dev add -f /dev/loop1 /btrfs :0
>>
>> # After adding the device, the path and UUID are different,
>> # so it’s a new filesystem. (But, as I said, I’m fine with
>> # keeping it read-only and needing remount,rw.
>>
>> $ cat /proc/self/mounts | grep btrfs :0
>> /dev/loop1 /btrfs btrfs ro,relatime,space_cache=v2,subvolid=5,subvol=/ 
>> 0 0
>>
>> $ findmnt -o SOURCE,UUID /btrfs :0
>> SOURCE     UUID
>> /dev/loop1 948cea35-18db-45da-9ec8-3d46cb5f0413
>>
>> $ btrfs fi show -m :0
>> Label: none  uuid: 948cea35-18db-45da-9ec8-3d46cb5f0413
>>      Total devices 2 FS bytes used 144.00KiB
>>      devid    1 size 3.00GiB used 520.00MiB path /dev/loop0
>>      devid    2 size 3.00GiB used 576.00MiB path /dev/loop1
>>
>>
>> $ ls /sys/fs/btrfs :0
>> 948cea35-18db-45da-9ec8-3d46cb5f0413
>> features
>> ---------
>>
>>
>> Thanks, Anand
>>
>>>   fs/btrfs/volumes.c | 4 ----
>>>   1 file changed, 4 deletions(-)
>>>
>>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>>> index dc9f54849f39..84e861dcb350 100644
>>> --- a/fs/btrfs/volumes.c
>>> +++ b/fs/btrfs/volumes.c
>>> @@ -2841,8 +2841,6 @@ int btrfs_init_new_device(struct btrfs_fs_info
>>> *fs_info, const char *device_path
>>>       set_blocksize(device->bdev_file, BTRFS_BDEV_BLOCKSIZE);
>>>       if (seeding_dev) {
>>> -        btrfs_clear_sb_rdonly(sb);
>>> -
>>>           /* GFP_KERNEL allocation must not be under 
>>> device_list_mutex */
>>>           seed_devices = btrfs_init_sprout(fs_info);
>>>           if (IS_ERR(seed_devices)) {
>>> @@ -2985,8 +2983,6 @@ int btrfs_init_new_device(struct btrfs_fs_info
>>> *fs_info, const char *device_path
>>>       mutex_unlock(&fs_info->chunk_mutex);
>>>       mutex_unlock(&fs_info->fs_devices->device_list_mutex);
>>>   error_trans:
>>> -    if (seeding_dev)
>>> -        btrfs_set_sb_rdonly(sb);
>>>       if (trans)
>>>           btrfs_end_transaction(trans);
>>>   error_free_zone:
>>
>>
>
David Sterba Oct. 21, 2024, 6:56 p.m. UTC | #10
On Thu, Oct 17, 2024 at 09:41:59AM -0700, Boris Burkov wrote:
> On Thu, Oct 17, 2024 at 04:01:12PM +0200, David Sterba wrote:
> > On Tue, Oct 15, 2024 at 02:38:34PM -0700, Boris Burkov wrote:
> > > If you follow the seed/sprout wiki, it suggests the following workflow:
> > > 
> > > btrfstune -S 1 seed_dev
> > > mount seed_dev mnt
> > > btrfs device add sprout_dev
> > > mount -o remount,rw mnt
> > > 
> > > The first mount mounts the FS readonly, which results in not setting
> > > BTRFS_FS_OPEN, and setting the readonly bit on the sb. The device add
> > > somewhat surprisingly clears the readonly bit on the sb (though the
> > > mount is still practically readonly, from the users perspective...).
> > > Finally, the remount checks the readonly bit on the sb against the flag
> > > and sees no change, so it does not run the code intended to run on
> > > ro->rw transitions, leaving BTRFS_FS_OPEN unset.
> > > 
> > > As a result, when the cleaner_kthread runs, it sees no BTRFS_FS_OPEN and
> > > does no work. This results in leaking deleted snapshots until we run out
> > > of space.
> > > 
> > > I propose fixing it at the first departure from what feels reasonable:
> > > when we clear the readonly bit on the sb during device add.
> > > 
> > > A new fstest I have written reproduces the bug and confirms the fix.
> > > 
> > > Signed-off-by: Boris Burkov <boris@bur.io>
> > > ---
> > > Note that this is a resend of an old unmerged fix:
> > > https://lore.kernel.org/linux-btrfs/16c05d39566858bb8bc1e03bd19947cf2b601b98.1647906815.git.boris@bur.io/T/#u
> > > Some other ideas for fixing it by modifying how we set BTRFS_FS_OPEN
> > > were also explored but not merged around that time:
> > > https://lore.kernel.org/linux-btrfs/cover.1654216941.git.anand.jain@oracle.com/
> > > 
> > > I don't have a strong preference, but I would really like to see this
> > > trivial bug fixed. For what it is worth, we have been carrying this
> > > patch internally at Meta since I first sent it with no incident.
> > 
> > We have an unresolved dispute about the fix and now the patch got to
> > for-next within a few days because it got two reviews but not mine.
> > The way you use it in Meta works for you but the fix is changing
> > behaviour so we can either ignore everybody else relying on the old
> > way or say that seeding is so niche that we don't care and see what we
> > can do once we get some report.
> 
> Please feel free to remove it, and I am happy to review whatever other
> fix you think is best. Sorry for rushing, I just wanted to get it done
> and out of my head so I could move on to other issues.

I'm concerned because change like that needs an announcement,
documentation changes or eventually an optional change so both use cases
are available. I haven't merged it since the last time you or somebody
posted it because I don't see a way how to fix it without fixing the bug
and not breaking the use case.

> I assume your concern is that before this fix, the filesystem is actually
> read-write after the device add, which this patch breaks?
> 
> My only argument against this is that the documentation has always said
> you needed to remount/cycle-mount after adding the sprout, so there is
> no fair reason to assume this would work. (In fact, it *doesn't*, the fs
> is once again in a unexpected, degraded, state...)
> 
> But if existing LiveCD users are relying on this undocumented behavior,
> then I think you are right and it's a bad idea to break them.

The problem here is that we don't know how the feature is used, the
documentation came much later than the feature. So I take it as that
people rely on code, not documenation, even if there's an undocumented
behaviour.
Boris Burkov Oct. 21, 2024, 7:29 p.m. UTC | #11
On Mon, Oct 21, 2024 at 08:56:51PM +0200, David Sterba wrote:
> On Thu, Oct 17, 2024 at 09:41:59AM -0700, Boris Burkov wrote:
> > On Thu, Oct 17, 2024 at 04:01:12PM +0200, David Sterba wrote:
> > > On Tue, Oct 15, 2024 at 02:38:34PM -0700, Boris Burkov wrote:
> > > > If you follow the seed/sprout wiki, it suggests the following workflow:
> > > > 
> > > > btrfstune -S 1 seed_dev
> > > > mount seed_dev mnt
> > > > btrfs device add sprout_dev
> > > > mount -o remount,rw mnt
> > > > 
> > > > The first mount mounts the FS readonly, which results in not setting
> > > > BTRFS_FS_OPEN, and setting the readonly bit on the sb. The device add
> > > > somewhat surprisingly clears the readonly bit on the sb (though the
> > > > mount is still practically readonly, from the users perspective...).
> > > > Finally, the remount checks the readonly bit on the sb against the flag
> > > > and sees no change, so it does not run the code intended to run on
> > > > ro->rw transitions, leaving BTRFS_FS_OPEN unset.
> > > > 
> > > > As a result, when the cleaner_kthread runs, it sees no BTRFS_FS_OPEN and
> > > > does no work. This results in leaking deleted snapshots until we run out
> > > > of space.
> > > > 
> > > > I propose fixing it at the first departure from what feels reasonable:
> > > > when we clear the readonly bit on the sb during device add.
> > > > 
> > > > A new fstest I have written reproduces the bug and confirms the fix.
> > > > 
> > > > Signed-off-by: Boris Burkov <boris@bur.io>
> > > > ---
> > > > Note that this is a resend of an old unmerged fix:
> > > > https://lore.kernel.org/linux-btrfs/16c05d39566858bb8bc1e03bd19947cf2b601b98.1647906815.git.boris@bur.io/T/#u
> > > > Some other ideas for fixing it by modifying how we set BTRFS_FS_OPEN
> > > > were also explored but not merged around that time:
> > > > https://lore.kernel.org/linux-btrfs/cover.1654216941.git.anand.jain@oracle.com/
> > > > 
> > > > I don't have a strong preference, but I would really like to see this
> > > > trivial bug fixed. For what it is worth, we have been carrying this
> > > > patch internally at Meta since I first sent it with no incident.
> > > 
> > > We have an unresolved dispute about the fix and now the patch got to
> > > for-next within a few days because it got two reviews but not mine.
> > > The way you use it in Meta works for you but the fix is changing
> > > behaviour so we can either ignore everybody else relying on the old
> > > way or say that seeding is so niche that we don't care and see what we
> > > can do once we get some report.
> > 
> > Please feel free to remove it, and I am happy to review whatever other
> > fix you think is best. Sorry for rushing, I just wanted to get it done
> > and out of my head so I could move on to other issues.
> 
> I'm concerned because change like that needs an announcement,
> documentation changes or eventually an optional change so both use cases
> are available. I haven't merged it since the last time you or somebody
> posted it because I don't see a way how to fix it without fixing the bug
> and not breaking the use case.
> 

That's fair. However, I assume we agree we need some fix, somewhere.
Anyone who is really currently relying on this and not cycle mounting
after adding the device doesn't get a cleaner thread, which includes
empty block group cleanup, autorelocation, deleting snapshots, and defrag.
I think subvolume cleanup and block group cleanup are probably the worst
to lose.

So let's step back from me impulsively stuffing this in to for-next and
do the right thing together :)

Thinking of our options, in no particular order:

1a. just land the fix
It's not great if a lot of people rely on the fs "working" after device
add. Some of them might quickly find the proper steps in the docs, but
this might confuse/upset people.

1b. land the fix but properly socialize it
We can also make btrfs device add notice that it is seed sprout and add
a loud message that you need to remount rw. If this requires skipping a
merge or two while we socialize it to obvious or google-able users, that
seems reasonable.

1c. land the fix and make btrfs-progs do the remount after the device add
This is kind of weird, but does preserve the expected behavior. We could
also add some seed sprout specific flags or commands as the "main" way to
do it.

2. have the device add detect the seed sprout case and really make the
fs read-write
If we are committed to the behavior, and don't believe in a userspace
only fix, then it could be possible to make the current path which
simply clears the ro bit also invoke some or all of the remount rw
logic.

3. do nothing (remove the fix from for-next)
I keep carrying an internal patch forever, upstream stays in danger of
running an fs without a cleaner. If no one runs a long-lived fs off a live
cd for long enough to care, maybe it is ok, too. But stuff like this is
out there in google, for example:
https://blog.elastocloud.org/2022/11/btrfs-seed-devices-for-ab-system-updates.html

That's all I can think of, but I'm definitely open to other ideas!

Thanks,
Boris

> > I assume your concern is that before this fix, the filesystem is actually
> > read-write after the device add, which this patch breaks?
> > 
> > My only argument against this is that the documentation has always said
> > you needed to remount/cycle-mount after adding the sprout, so there is
> > no fair reason to assume this would work. (In fact, it *doesn't*, the fs
> > is once again in a unexpected, degraded, state...)
> > 
> > But if existing LiveCD users are relying on this undocumented behavior,
> > then I think you are right and it's a bad idea to break them.
> 
> The problem here is that we don't know how the feature is used, the
> documentation came much later than the feature. So I take it as that
> people rely on code, not documenation, even if there's an undocumented
> behaviour.
diff mbox series

Patch

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index dc9f54849f39..84e861dcb350 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -2841,8 +2841,6 @@  int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
 	set_blocksize(device->bdev_file, BTRFS_BDEV_BLOCKSIZE);
 
 	if (seeding_dev) {
-		btrfs_clear_sb_rdonly(sb);
-
 		/* GFP_KERNEL allocation must not be under device_list_mutex */
 		seed_devices = btrfs_init_sprout(fs_info);
 		if (IS_ERR(seed_devices)) {
@@ -2985,8 +2983,6 @@  int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
 	mutex_unlock(&fs_info->chunk_mutex);
 	mutex_unlock(&fs_info->fs_devices->device_list_mutex);
 error_trans:
-	if (seeding_dev)
-		btrfs_set_sb_rdonly(sb);
 	if (trans)
 		btrfs_end_transaction(trans);
 error_free_zone: