diff mbox series

[v2,2/2] btrfs: automatically remove the subvolume qgroup

Message ID 07e54de6747a5bf1e6a422cba80cbb06ba832cf4.1713519718.git.wqu@suse.com (mailing list archive)
State New, archived
Headers show
Series btrfs: qgroup: stale qgroups related impromvents | expand

Commit Message

Qu Wenruo April 19, 2024, 9:46 a.m. UTC
Currently if we fully removed a subvolume (not only unlinked, but fully
dropped its root item), its qgroup would not be removed.

Thus we have "btrfs qgroup clear-stale" to handle such 0 level qgroups.

This patch changes the behavior by automatically removing the qgroup of
a fully dropped subvolume.

There is an exception for simple quota, that btrfs_record_squota_delta()
has to handle missing qgroup case, where the target delta belongs to an
automatically deleted subvolume.
In that case since the subvolume is already gone, no need to treat it as
an error.

Link: https://bugzilla.suse.com/show_bug.cgi?id=1222847
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent-tree.c |  8 ++++++++
 fs/btrfs/qgroup.c      | 39 ++++++++++++++++++++++++++++++++++++++-
 fs/btrfs/qgroup.h      |  2 ++
 3 files changed, 48 insertions(+), 1 deletion(-)

Comments

David Sterba April 24, 2024, 12:41 p.m. UTC | #1
On Fri, Apr 19, 2024 at 07:16:53PM +0930, Qu Wenruo wrote:
> Currently if we fully removed a subvolume (not only unlinked, but fully
> dropped its root item), its qgroup would not be removed.
> 
> Thus we have "btrfs qgroup clear-stale" to handle such 0 level qgroups.

There's also an option 'btrfs subvolume delete --delete-qgroup' that
does that and is going to be default in 6.9. With this kernel change it
would break the behaviour of the --no-delete-qgroup, which is there for
the case something depends on that.  For now I'd rather postpone
changing the kernel behaviour.
Qu Wenruo April 24, 2024, 10:19 p.m. UTC | #2
在 2024/4/24 22:11, David Sterba 写道:
> On Fri, Apr 19, 2024 at 07:16:53PM +0930, Qu Wenruo wrote:
>> Currently if we fully removed a subvolume (not only unlinked, but fully
>> dropped its root item), its qgroup would not be removed.
>>
>> Thus we have "btrfs qgroup clear-stale" to handle such 0 level qgroups.
>
> There's also an option 'btrfs subvolume delete --delete-qgroup' that
> does that and is going to be default in 6.9. With this kernel change it
> would break the behaviour of the --no-delete-qgroup, which is there for
> the case something depends on that.  For now I'd rather postpone
> changing the kernel behaviour.
>

A quick glance of the --delete-qgroup shows it won't work as expected at
all.

Firstly, the qgroup delete requires the qgroup numbers to be 0.
Meanwhile qgroup numbers can only be 0 after 1) the full subvolume has
been dropped 2) a transaction is committed to reflect the qgroup numbers.

Both situation is only handled in my patchset, thus this means for a lot
of cases it won't work at all.

Furthermore, there is the drop_subtree_threshold thing, which can mark
qgroup inconsistent and skip accounting, making the target subvolume's
qgroup numbers never fall back to 0 (until next rescan).

So I'm afraid the --delete-qgroup won't work until the 1/2 patch get
merged (allowing deleting qgroups as long as the target subvolume is gone).

Thanks,
Qu
David Sterba April 25, 2024, 12:34 p.m. UTC | #3
On Thu, Apr 25, 2024 at 07:49:12AM +0930, Qu Wenruo wrote:
> 
> 
> 在 2024/4/24 22:11, David Sterba 写道:
> > On Fri, Apr 19, 2024 at 07:16:53PM +0930, Qu Wenruo wrote:
> >> Currently if we fully removed a subvolume (not only unlinked, but fully
> >> dropped its root item), its qgroup would not be removed.
> >>
> >> Thus we have "btrfs qgroup clear-stale" to handle such 0 level qgroups.
> >
> > There's also an option 'btrfs subvolume delete --delete-qgroup' that
> > does that and is going to be default in 6.9. With this kernel change it
> > would break the behaviour of the --no-delete-qgroup, which is there for
> > the case something depends on that.  For now I'd rather postpone
> > changing the kernel behaviour.
> >
> 
> A quick glance of the --delete-qgroup shows it won't work as expected at
> all.
> 
> Firstly, the qgroup delete requires the qgroup numbers to be 0.
> Meanwhile qgroup numbers can only be 0 after 1) the full subvolume has
> been dropped 2) a transaction is committed to reflect the qgroup numbers.

The deletion option calls ioctl, so this means that 'btrfs qgroup remove'
will not delete it either?

> Both situation is only handled in my patchset, thus this means for a lot
> of cases it won't work at all.
> 
> Furthermore, there is the drop_subtree_threshold thing, which can mark
> qgroup inconsistent and skip accounting, making the target subvolume's
> qgroup numbers never fall back to 0 (until next rescan).
> 
> So I'm afraid the --delete-qgroup won't work until the 1/2 patch get
> merged (allowing deleting qgroups as long as the target subvolume is gone).

Ok, so for emulation of the complete removal in userspace it's

btrfs subvolume delete 123
btrfs subvolume sync 123
btrfs qgroup remove 0/123

but this needs to wait until the sync is finished and that is not
expected for the subvolume delete command. It needs to be fixed but now
I'm not sure this can be default in 6.9 as planned.
Qu Wenruo April 25, 2024, 9:51 p.m. UTC | #4
在 2024/4/25 22:04, David Sterba 写道:
> On Thu, Apr 25, 2024 at 07:49:12AM +0930, Qu Wenruo wrote:
>>
>>
>> 在 2024/4/24 22:11, David Sterba 写道:
>>> On Fri, Apr 19, 2024 at 07:16:53PM +0930, Qu Wenruo wrote:
>>>> Currently if we fully removed a subvolume (not only unlinked, but fully
>>>> dropped its root item), its qgroup would not be removed.
>>>>
>>>> Thus we have "btrfs qgroup clear-stale" to handle such 0 level qgroups.
>>>
>>> There's also an option 'btrfs subvolume delete --delete-qgroup' that
>>> does that and is going to be default in 6.9. With this kernel change it
>>> would break the behaviour of the --no-delete-qgroup, which is there for
>>> the case something depends on that.  For now I'd rather postpone
>>> changing the kernel behaviour.
>>>
>>
>> A quick glance of the --delete-qgroup shows it won't work as expected at
>> all.
>>
>> Firstly, the qgroup delete requires the qgroup numbers to be 0.
>> Meanwhile qgroup numbers can only be 0 after 1) the full subvolume has
>> been dropped 2) a transaction is committed to reflect the qgroup numbers.
>
> The deletion option calls ioctl, so this means that 'btrfs qgroup remove'
> will not delete it either?

Nope, at least if the subvolume is not cleaned up immediately.
>
>> Both situation is only handled in my patchset, thus this means for a lot
>> of cases it won't work at all.
>>
>> Furthermore, there is the drop_subtree_threshold thing, which can mark
>> qgroup inconsistent and skip accounting, making the target subvolume's
>> qgroup numbers never fall back to 0 (until next rescan).
>>
>> So I'm afraid the --delete-qgroup won't work until the 1/2 patch get
>> merged (allowing deleting qgroups as long as the target subvolume is gone).
>
> Ok, so for emulation of the complete removal in userspace it's
>
> btrfs subvolume delete 123
> btrfs subvolume sync 123
> btrfs qgroup remove 0/123
>
> but this needs to wait until the sync is finished and that is not
> expected for the subvolume delete command.

That's the problem, and why doing it in user space has it limits.

Furthermore, with drop_subtree_threshold or other qgroup operations
marking the qgroup inconsistent, you can not delete that qgroup at all,
until the next rescan.

> It needs to be fixed but now
> I'm not sure this can be default in 6.9 as planned.

I'd say, you should not implement this feature without really
understanding the challenges in the first place.

And that's why I really prefer you send out non-trivial btrfs-progs for
review, other than pushing them directly into github repo.

Thanks,
Qu
Boris Burkov April 29, 2024, 12:57 p.m. UTC | #5
On Fri, Apr 19, 2024 at 07:16:53PM +0930, Qu Wenruo wrote:
> Currently if we fully removed a subvolume (not only unlinked, but fully
> dropped its root item), its qgroup would not be removed.
> 
> Thus we have "btrfs qgroup clear-stale" to handle such 0 level qgroups.
> 
> This patch changes the behavior by automatically removing the qgroup of
> a fully dropped subvolume.
> 
> There is an exception for simple quota, that btrfs_record_squota_delta()
> has to handle missing qgroup case, where the target delta belongs to an
> automatically deleted subvolume.
> In that case since the subvolume is already gone, no need to treat it as
> an error.
> 
> Link: https://bugzilla.suse.com/show_bug.cgi?id=1222847
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/extent-tree.c |  8 ++++++++
>  fs/btrfs/qgroup.c      | 39 ++++++++++++++++++++++++++++++++++++++-
>  fs/btrfs/qgroup.h      |  2 ++
>  3 files changed, 48 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 023920d0d971..1e2caa234146 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -5834,6 +5834,7 @@ int btrfs_drop_snapshot(struct btrfs_root *root, int update_ref, int for_reloc)
>  	struct btrfs_root_item *root_item = &root->root_item;
>  	struct walk_control *wc;
>  	struct btrfs_key key;
> +	const u64 rootid = btrfs_root_id(root);
>  	int err = 0;
>  	int ret;
>  	int level;
> @@ -6064,6 +6065,13 @@ int btrfs_drop_snapshot(struct btrfs_root *root, int update_ref, int for_reloc)
>  	kfree(wc);
>  	btrfs_free_path(path);
>  out:
> +	if (!err && root_dropped) {
> +		ret = btrfs_qgroup_cleanup_dropped_subvolume(fs_info, rootid);
> +		if (ret < 0)
> +			btrfs_warn_rl(fs_info,
> +				      "failed to cleanup qgroup 0/%llu: %d",
> +				      rootid, ret);
> +	}
>  	/*
>  	 * We were an unfinished drop root, check to see if there are any
>  	 * pending, and if not clear and wake up any waiters.
> diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
> index 2ea16a07a7d4..9aeb740388ab 100644
> --- a/fs/btrfs/qgroup.c
> +++ b/fs/btrfs/qgroup.c
> @@ -1859,6 +1859,39 @@ int btrfs_remove_qgroup(struct btrfs_trans_handle *trans, u64 qgroupid)
>  	return ret;
>  }
>  
> +int btrfs_qgroup_cleanup_dropped_subvolume(struct btrfs_fs_info *fs_info,
> +					   u64 subvolid)
> +{
> +	struct btrfs_trans_handle *trans;
> +	int ret;
> +
> +	if (!is_fstree(subvolid) || !btrfs_qgroup_enabled(fs_info) ||
> +	    !fs_info->quota_root)
> +		return 0;
> +
> +	/*
> +	 * If our qgroup is consistent, commit current transaction to make sure
> +	 * all the rfer/excl numbers get updated to 0 before deleting.
> +	 */
> +	if (!(fs_info->qgroup_flags & BTRFS_QGROUP_STATUS_FLAG_INCONSISTENT)) {
> +		trans = btrfs_start_transaction(fs_info->quota_root, 0);
> +		if (IS_ERR(trans))
> +			return PTR_ERR(trans);
> +
> +		ret = btrfs_commit_transaction(trans);
> +		if (ret < 0)
> +			return ret;
> +	}
> +
> +	/* Start new trans to delete the qgroup info and limit items. */
> +	trans = btrfs_start_transaction(fs_info->quota_root, 2);
> +	if (IS_ERR(trans))
> +		return PTR_ERR(trans);
> +	ret = btrfs_remove_qgroup(trans, subvolid);
> +	btrfs_end_transaction(trans);
> +	return ret;
> +}
> +
>  int btrfs_limit_qgroup(struct btrfs_trans_handle *trans, u64 qgroupid,
>  		       struct btrfs_qgroup_limit *limit)
>  {
> @@ -4877,7 +4910,11 @@ int btrfs_record_squota_delta(struct btrfs_fs_info *fs_info,
>  	spin_lock(&fs_info->qgroup_lock);
>  	qgroup = find_qgroup_rb(fs_info, root);
>  	if (!qgroup) {
> -		ret = -ENOENT;
> +		/*
> +		 * The qgroup can be auto deleted by subvolume deleting.
> +		 * In that case do not consider it an error.
> +		 */
> +		ret = 0;

If we call record_squota_delta, that means we are adding an extent to
the tree whose owner ref will point at this subvol. Which means that we
are entering into a case where our on disk accounting disagrees between
qgroup and extent tree. This should fail the squota fsck logic, if you
really trigger it.

How were you hitting this? In some mundane way where the extent actually
gets dropped? I worry that if the extent also picks up a reference in
the same transaction in a way that the owner is the dropped subvol, we
could get into inconsistency. Ideally, the checks on rsv/usage in drop
also prevent that..

>  		goto out;
>  	}
>  
> diff --git a/fs/btrfs/qgroup.h b/fs/btrfs/qgroup.h
> index 706640be0ec2..3f93856a02e1 100644
> --- a/fs/btrfs/qgroup.h
> +++ b/fs/btrfs/qgroup.h
> @@ -327,6 +327,8 @@ int btrfs_del_qgroup_relation(struct btrfs_trans_handle *trans, u64 src,
>  			      u64 dst);
>  int btrfs_create_qgroup(struct btrfs_trans_handle *trans, u64 qgroupid);
>  int btrfs_remove_qgroup(struct btrfs_trans_handle *trans, u64 qgroupid);
> +int btrfs_qgroup_cleanup_dropped_subvolume(struct btrfs_fs_info *fs_info,
> +					   u64 subvolid);
>  int btrfs_limit_qgroup(struct btrfs_trans_handle *trans, u64 qgroupid,
>  		       struct btrfs_qgroup_limit *limit);
>  int btrfs_read_qgroup_config(struct btrfs_fs_info *fs_info);
> -- 
> 2.44.0
>
Boris Burkov April 29, 2024, 1:13 p.m. UTC | #6
On Fri, Apr 26, 2024 at 07:21:08AM +0930, Qu Wenruo wrote:
> 
> 
> 在 2024/4/25 22:04, David Sterba 写道:
> > On Thu, Apr 25, 2024 at 07:49:12AM +0930, Qu Wenruo wrote:
> > > 
> > > 
> > > 在 2024/4/24 22:11, David Sterba 写道:
> > > > On Fri, Apr 19, 2024 at 07:16:53PM +0930, Qu Wenruo wrote:
> > > > > Currently if we fully removed a subvolume (not only unlinked, but fully
> > > > > dropped its root item), its qgroup would not be removed.
> > > > > 
> > > > > Thus we have "btrfs qgroup clear-stale" to handle such 0 level qgroups.
> > > > 
> > > > There's also an option 'btrfs subvolume delete --delete-qgroup' that
> > > > does that and is going to be default in 6.9. With this kernel change it
> > > > would break the behaviour of the --no-delete-qgroup, which is there for
> > > > the case something depends on that.  For now I'd rather postpone
> > > > changing the kernel behaviour.
> > > > 
> > > 
> > > A quick glance of the --delete-qgroup shows it won't work as expected at
> > > all.
> > > 
> > > Firstly, the qgroup delete requires the qgroup numbers to be 0.
> > > Meanwhile qgroup numbers can only be 0 after 1) the full subvolume has
> > > been dropped 2) a transaction is committed to reflect the qgroup numbers.
> > 
> > The deletion option calls ioctl, so this means that 'btrfs qgroup remove'
> > will not delete it either?
> 
> Nope, at least if the subvolume is not cleaned up immediately.
> > 
> > > Both situation is only handled in my patchset, thus this means for a lot
> > > of cases it won't work at all.
> > > 
> > > Furthermore, there is the drop_subtree_threshold thing, which can mark
> > > qgroup inconsistent and skip accounting, making the target subvolume's
> > > qgroup numbers never fall back to 0 (until next rescan).
> > > 
> > > So I'm afraid the --delete-qgroup won't work until the 1/2 patch get
> > > merged (allowing deleting qgroups as long as the target subvolume is gone).
> > 
> > Ok, so for emulation of the complete removal in userspace it's
> > 
> > btrfs subvolume delete 123
> > btrfs subvolume sync 123
> > btrfs qgroup remove 0/123
> > 
> > but this needs to wait until the sync is finished and that is not
> > expected for the subvolume delete command.
> 
> That's the problem, and why doing it in user space has it limits.
> 
> Furthermore, with drop_subtree_threshold or other qgroup operations
> marking the qgroup inconsistent, you can not delete that qgroup at all,
> until the next rescan.
> 
> > It needs to be fixed but now
> > I'm not sure this can be default in 6.9 as planned.
> 
> I'd say, you should not implement this feature without really
> understanding the challenges in the first place.
> 
> And that's why I really prefer you send out non-trivial btrfs-progs for
> review, other than pushing them directly into github repo.
> 
> Thanks,
> Qu

I support the auto deletion in the kernel as you propose, I think it
just makes sense. Who wants stale, empty qgroups around that aren't
attached to any subvol? I suppose that with the drop_thresh thing, it is
possible some parent qgroup still reflects the usage until the next full
scan?

Thinking out loud -- for regular qgroups, we could avoid this all if we
do the reaping when usage goes to 0 and there is no subvol. So remove
the qgroup as a consequence of the rescan, not the subvol delete. I
imagine this is quite a bit messier, though :(

We could also just not auto-reap if that condition occurs (inconsistent
qg with a parent), but I think that may be surprising for the same
reasons that got you working on this in the first place...

Do we know of an explicit need to support --no-delete-qgroup? It feels
like it is perfectly normal for us to improve the default behavior of
the kernel or userspace tools without supporting the old behavior as a
flag forever (without a user).

Put another way, I think it would be perfectly reasonable to term the
stale qgroups a leaked memory resource and this patch a bug fix, if we
are willing to get overly philosophical about it. We don't carry around
eternal flags for bug fixes, unless it's some rather exceptional case.

If we are on the hook for supporing that flag because we already added
it to progs and don't want to deprecate it, then maybe we can think of
something compatible with default auto-reap.

Thanks,
Boris
David Sterba April 29, 2024, 4:31 p.m. UTC | #7
On Mon, Apr 29, 2024 at 06:13:33AM -0700, Boris Burkov wrote:
> I support the auto deletion in the kernel as you propose, I think it
> just makes sense. Who wants stale, empty qgroups around that aren't
> attached to any subvol? I suppose that with the drop_thresh thing, it is
> possible some parent qgroup still reflects the usage until the next full
> scan?

The stale qgroups have been out for a long time so removing them after
subvolume deletion is changing default behaviour, this always breaks
somebody's scripts or tools.

> Thinking out loud -- for regular qgroups, we could avoid this all if we
> do the reaping when usage goes to 0 and there is no subvol. So remove
> the qgroup as a consequence of the rescan, not the subvol delete. I
> imagine this is quite a bit messier, though :(
> 
> We could also just not auto-reap if that condition occurs (inconsistent
> qg with a parent), but I think that may be surprising for the same
> reasons that got you working on this in the first place...
> 
> Do we know of an explicit need to support --no-delete-qgroup? It feels
> like it is perfectly normal for us to improve the default behavior of
> the kernel or userspace tools without supporting the old behavior as a
> flag forever (without a user).

$ id=$(btrfs inspect rootid subvol)
$ btrfs subvolume delete subvol
$ btrfs qgroup remove 0/$id 1/1 .                   <---- fails
$ btrfs qgroup destroy 0/$id .                      <---- fails

> Put another way, I think it would be perfectly reasonable to term the
> stale qgroups a leaked memory resource and this patch a bug fix, if we
> are willing to get overly philosophical about it. We don't carry around
> eternal flags for bug fixes, unless it's some rather exceptional case.

The command line option does not do what I expected, if somebody would
have to update the scripts to add it then we can do the kernel
auto-remove and the document it. Eventually we can translate the -ENOENT
error code to be ignored.

> If we are on the hook for supporing that flag because we already added
> it to progs and don't want to deprecate it, then maybe we can think of
> something compatible with default auto-reap.
David Sterba April 29, 2024, 4:36 p.m. UTC | #8
On Fri, Apr 26, 2024 at 07:21:08AM +0930, Qu Wenruo wrote:
> > btrfs subvolume delete 123
> > btrfs subvolume sync 123
> > btrfs qgroup remove 0/123
> >
> > but this needs to wait until the sync is finished and that is not
> > expected for the subvolume delete command.
> 
> That's the problem, and why doing it in user space has it limits.
> 
> Furthermore, with drop_subtree_threshold or other qgroup operations
> marking the qgroup inconsistent, you can not delete that qgroup at all,
> until the next rescan.

Ok so that makes it more complicated and better solved in kernel, with
the compatibility issues mentioned in the other mail.

> > It needs to be fixed but now
> > I'm not sure this can be default in 6.9 as planned.
> 
> I'd say, you should not implement this feature without really
> understanding the challenges in the first place.
> 
> And that's why I really prefer you send out non-trivial btrfs-progs for
> review, other than pushing them directly into github repo.

As discussed on slack, the development happens in git, for btrfs-progs
the mails are overhead that does not pay off.
Qu Wenruo April 29, 2024, 10:05 p.m. UTC | #9
在 2024/4/30 02:01, David Sterba 写道:
> On Mon, Apr 29, 2024 at 06:13:33AM -0700, Boris Burkov wrote:
>> I support the auto deletion in the kernel as you propose, I think it
>> just makes sense. Who wants stale, empty qgroups around that aren't
>> attached to any subvol? I suppose that with the drop_thresh thing, it is
>> possible some parent qgroup still reflects the usage until the next full
>> scan?
>
> The stale qgroups have been out for a long time so removing them after
> subvolume deletion is changing default behaviour, this always breaks
> somebody's scripts or tools.

If needed I can introduce a compat bit (the first one), to tell the
behavior difference.

And if we go the compat bit way, I believe it can be the first example
on how to do a kernel behavior change correctly without breaking any
user space assumption.

Thanks,
Qu
>
>> Thinking out loud -- for regular qgroups, we could avoid this all if we
>> do the reaping when usage goes to 0 and there is no subvol. So remove
>> the qgroup as a consequence of the rescan, not the subvol delete. I
>> imagine this is quite a bit messier, though :(
>>
>> We could also just not auto-reap if that condition occurs (inconsistent
>> qg with a parent), but I think that may be surprising for the same
>> reasons that got you working on this in the first place...
>>
>> Do we know of an explicit need to support --no-delete-qgroup? It feels
>> like it is perfectly normal for us to improve the default behavior of
>> the kernel or userspace tools without supporting the old behavior as a
>> flag forever (without a user).
>
> $ id=$(btrfs inspect rootid subvol)
> $ btrfs subvolume delete subvol
> $ btrfs qgroup remove 0/$id 1/1 .                   <---- fails
> $ btrfs qgroup destroy 0/$id .                      <---- fails
>
>> Put another way, I think it would be perfectly reasonable to term the
>> stale qgroups a leaked memory resource and this patch a bug fix, if we
>> are willing to get overly philosophical about it. We don't carry around
>> eternal flags for bug fixes, unless it's some rather exceptional case.
>
> The command line option does not do what I expected, if somebody would
> have to update the scripts to add it then we can do the kernel
> auto-remove and the document it. Eventually we can translate the -ENOENT
> error code to be ignored.
>
>> If we are on the hook for supporing that flag because we already added
>> it to progs and don't want to deprecate it, then maybe we can think of
>> something compatible with default auto-reap.
Qu Wenruo April 29, 2024, 10:49 p.m. UTC | #10
在 2024/4/29 22:43, Boris Burkov 写道:
> On Fri, Apr 26, 2024 at 07:21:08AM +0930, Qu Wenruo wrote:
[...]
>> I'd say, you should not implement this feature without really
>> understanding the challenges in the first place.
>>
>> And that's why I really prefer you send out non-trivial btrfs-progs for
>> review, other than pushing them directly into github repo.
>>
>> Thanks,
>> Qu
>
> I support the auto deletion in the kernel as you propose, I think it
> just makes sense. Who wants stale, empty qgroups around that aren't
> attached to any subvol? I suppose that with the drop_thresh thing, it is
> possible some parent qgroup still reflects the usage until the next full
> scan?

Yep, that's the problem.
Although I can also do the extra dropping for regular qgroups, but since
it's already inconsistent, modifying the numbers won't make much
difference anyway.

>
> Thinking out loud -- for regular qgroups, we could avoid this all if we
> do the reaping when usage goes to 0 and there is no subvol. So remove
> the qgroup as a consequence of the rescan, not the subvol delete. I
> imagine this is quite a bit messier, though :(

This may be a little more complex, we need to do the extra check at
qgroup accounting time.
And IIRC there may be some corner case that numbers of a qgroup dropped
to 0 temporarily, but later extents accounting would increase it back.

Another concern is related to rescan.
Rescan would mark all qgroup to have 0 numbers first, then let rescan to
refill them all.

This means for a fully dropped subvolume in INCONSISTENT mode, there is
no other timing other than rescan setting its numbers to 0.
So dropping the qgroup when its number goes back to 0 won't happen for it.

>
> We could also just not auto-reap if that condition occurs (inconsistent
> qg with a parent), but I think that may be surprising for the same
> reasons that got you working on this in the first place...
>
> Do we know of an explicit need to support --no-delete-qgroup? It feels
> like it is perfectly normal for us to improve the default behavior of
> the kernel or userspace tools without supporting the old behavior as a
> flag forever (without a user).

I do not think any one would intentionally prefer a stale one.

But considering if squota requires a stale qgroup to keep the accounting
correct, the difference between regular qgroup and squota may grow
larger and larger in the future.

I'll need to re-consider the qgroup drop condition to make it only
possible to delete really stale qgroups (all zero and no subvolume for
both regular qgroup/squota, but for inconsistent regular qgroup,
allowing dropping non-zero qgroups).

>
> Put another way, I think it would be perfectly reasonable to term the
> stale qgroups a leaked memory resource and this patch a bug fix, if we
> are willing to get overly philosophical about it. We don't carry around
> eternal flags for bug fixes, unless it's some rather exceptional case.
>
> If we are on the hook for supporing that flag because we already added
> it to progs and don't want to deprecate it, then maybe we can think of
> something compatible with default auto-reap.

Like a compat bit to represent if we do the auto-reap.
This would definitely make older script happy.

But I think it's a little overkilled.

Thanks,
Qu
>
> Thanks,
> Boris
>
David Sterba April 30, 2024, 10:59 a.m. UTC | #11
On Tue, Apr 30, 2024 at 07:35:11AM +0930, Qu Wenruo wrote:
> 
> 
> 在 2024/4/30 02:01, David Sterba 写道:
> > On Mon, Apr 29, 2024 at 06:13:33AM -0700, Boris Burkov wrote:
> >> I support the auto deletion in the kernel as you propose, I think it
> >> just makes sense. Who wants stale, empty qgroups around that aren't
> >> attached to any subvol? I suppose that with the drop_thresh thing, it is
> >> possible some parent qgroup still reflects the usage until the next full
> >> scan?
> >
> > The stale qgroups have been out for a long time so removing them after
> > subvolume deletion is changing default behaviour, this always breaks
> > somebody's scripts or tools.
> 
> If needed I can introduce a compat bit (the first one), to tell the
> behavior difference.
> 
> And if we go the compat bit way, I believe it can be the first example
> on how to do a kernel behavior change correctly without breaking any
> user space assumption.

I don't see how a compat bit would work here, we use them for feature
compatibility and for general access to data (full or read-only). What
we do with individual behavioral changes are sysfs files. They're
detectable by scripts and can be also used for configuration. In this
case enabling/disabling autoclean of the qgroups.
Qu Wenruo April 30, 2024, 10:05 p.m. UTC | #12
在 2024/4/30 20:29, David Sterba 写道:
> On Tue, Apr 30, 2024 at 07:35:11AM +0930, Qu Wenruo wrote:
>>
>>
>> 在 2024/4/30 02:01, David Sterba 写道:
>>> On Mon, Apr 29, 2024 at 06:13:33AM -0700, Boris Burkov wrote:
>>>> I support the auto deletion in the kernel as you propose, I think it
>>>> just makes sense. Who wants stale, empty qgroups around that aren't
>>>> attached to any subvol? I suppose that with the drop_thresh thing, it is
>>>> possible some parent qgroup still reflects the usage until the next full
>>>> scan?
>>>
>>> The stale qgroups have been out for a long time so removing them after
>>> subvolume deletion is changing default behaviour, this always breaks
>>> somebody's scripts or tools.
>>
>> If needed I can introduce a compat bit (the first one), to tell the
>> behavior difference.
>>
>> And if we go the compat bit way, I believe it can be the first example
>> on how to do a kernel behavior change correctly without breaking any
>> user space assumption.
>
> I don't see how a compat bit would work here, we use them for feature
> compatibility and for general access to data (full or read-only). What
> we do with individual behavioral changes are sysfs files. They're
> detectable by scripts and can be also used for configuration. In this
> case enabling/disabling autoclean of the qgroups.
>

I mean the compat bit, which is fully empty now.

The new bit would be something like BTRFS_QGROUP_AUTO_REMOVAL, with that
set, btrfs would auto remove any stale qgroups (for regular qgroups though).

Without that, it would be as usual (no auto removal).

Since this doesn't cause any on-disk change, it does not needs compat-ro
nor incompat.

Thanks,
Qu
Boris Burkov April 30, 2024, 10:18 p.m. UTC | #13
On Wed, May 01, 2024 at 07:35:09AM +0930, Qu Wenruo wrote:
> 
> 
> 在 2024/4/30 20:29, David Sterba 写道:
> > On Tue, Apr 30, 2024 at 07:35:11AM +0930, Qu Wenruo wrote:
> > > 
> > > 
> > > 在 2024/4/30 02:01, David Sterba 写道:
> > > > On Mon, Apr 29, 2024 at 06:13:33AM -0700, Boris Burkov wrote:
> > > > > I support the auto deletion in the kernel as you propose, I think it
> > > > > just makes sense. Who wants stale, empty qgroups around that aren't
> > > > > attached to any subvol? I suppose that with the drop_thresh thing, it is
> > > > > possible some parent qgroup still reflects the usage until the next full
> > > > > scan?
> > > > 
> > > > The stale qgroups have been out for a long time so removing them after
> > > > subvolume deletion is changing default behaviour, this always breaks
> > > > somebody's scripts or tools.
> > > 
> > > If needed I can introduce a compat bit (the first one), to tell the
> > > behavior difference.
> > > 
> > > And if we go the compat bit way, I believe it can be the first example
> > > on how to do a kernel behavior change correctly without breaking any
> > > user space assumption.
> > 
> > I don't see how a compat bit would work here, we use them for feature
> > compatibility and for general access to data (full or read-only). What
> > we do with individual behavioral changes are sysfs files. They're
> > detectable by scripts and can be also used for configuration. In this
> > case enabling/disabling autoclean of the qgroups.

This was my initial thought too, but your compat bit idea is interesting
since it persists? I vote sysfs since it has good
infrastructure/momentum already for similar config.

> > 
> 
> I mean the compat bit, which is fully empty now.
> 
> The new bit would be something like BTRFS_QGROUP_AUTO_REMOVAL, with that
> set, btrfs would auto remove any stale qgroups (for regular qgroups though).
> 
> Without that, it would be as usual (no auto removal).
> 
> Since this doesn't cause any on-disk change, it does not needs compat-ro
> nor incompat.
> 
> Thanks,
> Qu
Qu Wenruo April 30, 2024, 10:27 p.m. UTC | #14
在 2024/5/1 07:48, Boris Burkov 写道:
> On Wed, May 01, 2024 at 07:35:09AM +0930, Qu Wenruo wrote:
[...]
>>>
>>> I don't see how a compat bit would work here, we use them for feature
>>> compatibility and for general access to data (full or read-only). What
>>> we do with individual behavioral changes are sysfs files. They're
>>> detectable by scripts and can be also used for configuration. In this
>>> case enabling/disabling autoclean of the qgroups.
>
> This was my initial thought too, but your compat bit idea is interesting
> since it persists? I vote sysfs since it has good
> infrastructure/momentum already for similar config.

Sysfs is another way which we are already utilizing for qgroups, like
drop_subtree_threshold.

The problem is as mentioned already, it's not persistent, thus it needs
a user space daemon to set it for every fs after mount.

I'm totally fine to go sysfs for now, but I really hope to a persistent
solution.
Maybe a dedicated config tree?

Thanks,
Qu

>
>>>
>>
>> I mean the compat bit, which is fully empty now.
>>
>> The new bit would be something like BTRFS_QGROUP_AUTO_REMOVAL, with that
>> set, btrfs would auto remove any stale qgroups (for regular qgroups though).
>>
>> Without that, it would be as usual (no auto removal).
>>
>> Since this doesn't cause any on-disk change, it does not needs compat-ro
>> nor incompat.
>>
>> Thanks,
>> Qu
David Sterba May 2, 2024, 3 p.m. UTC | #15
On Wed, May 01, 2024 at 07:35:09AM +0930, Qu Wenruo wrote:
> 
> 
> 在 2024/4/30 20:29, David Sterba 写道:
> > On Tue, Apr 30, 2024 at 07:35:11AM +0930, Qu Wenruo wrote:
> >>
> >>
> >> 在 2024/4/30 02:01, David Sterba 写道:
> >>> On Mon, Apr 29, 2024 at 06:13:33AM -0700, Boris Burkov wrote:
> >>>> I support the auto deletion in the kernel as you propose, I think it
> >>>> just makes sense. Who wants stale, empty qgroups around that aren't
> >>>> attached to any subvol? I suppose that with the drop_thresh thing, it is
> >>>> possible some parent qgroup still reflects the usage until the next full
> >>>> scan?
> >>>
> >>> The stale qgroups have been out for a long time so removing them after
> >>> subvolume deletion is changing default behaviour, this always breaks
> >>> somebody's scripts or tools.
> >>
> >> If needed I can introduce a compat bit (the first one), to tell the
> >> behavior difference.
> >>
> >> And if we go the compat bit way, I believe it can be the first example
> >> on how to do a kernel behavior change correctly without breaking any
> >> user space assumption.
> >
> > I don't see how a compat bit would work here, we use them for feature
> > compatibility and for general access to data (full or read-only). What
> > we do with individual behavioral changes are sysfs files. They're
> > detectable by scripts and can be also used for configuration. In this
> > case enabling/disabling autoclean of the qgroups.
> 
> I mean the compat bit, which is fully empty now.

Which compat bit and in which structure? What I mean is the super block
incompat/compat bits but what you described below does not match it.

> The new bit would be something like BTRFS_QGROUP_AUTO_REMOVAL, with that
> set, btrfs would auto remove any stale qgroups (for regular qgroups though).
> 
> Without that, it would be as usual (no auto removal).
David Sterba May 2, 2024, 3:03 p.m. UTC | #16
On Wed, May 01, 2024 at 07:57:08AM +0930, Qu Wenruo wrote:
> 
> 
> 在 2024/5/1 07:48, Boris Burkov 写道:
> > On Wed, May 01, 2024 at 07:35:09AM +0930, Qu Wenruo wrote:
> [...]
> >>>
> >>> I don't see how a compat bit would work here, we use them for feature
> >>> compatibility and for general access to data (full or read-only). What
> >>> we do with individual behavioral changes are sysfs files. They're
> >>> detectable by scripts and can be also used for configuration. In this
> >>> case enabling/disabling autoclean of the qgroups.
> >
> > This was my initial thought too, but your compat bit idea is interesting
> > since it persists? I vote sysfs since it has good
> > infrastructure/momentum already for similar config.
> 
> Sysfs is another way which we are already utilizing for qgroups, like
> drop_subtree_threshold.
> 
> The problem is as mentioned already, it's not persistent, thus it needs
> a user space daemon to set it for every fs after mount.

My idea of using sysfs is to export the information that the
autocleaning feature is present and if we make it on by default then
there's no need for additional step to enable it. The feedback about
that was that it should have been default so we're going to make that
change, but with sysfs export also provide a fallback to disable it in
case it breaks things for somebody.

> I'm totally fine to go sysfs for now, but I really hope to a persistent
> solution.
> Maybe a dedicated config tree?

No, we already have a way to store data in the trees or in the
properties so no new tree.
Qu Wenruo May 2, 2024, 9:27 p.m. UTC | #17
在 2024/5/3 00:30, David Sterba 写道:
> On Wed, May 01, 2024 at 07:35:09AM +0930, Qu Wenruo wrote:
>>
>>
>> 在 2024/4/30 20:29, David Sterba 写道:
>>> On Tue, Apr 30, 2024 at 07:35:11AM +0930, Qu Wenruo wrote:
>>>>
>>>>
>>>> 在 2024/4/30 02:01, David Sterba 写道:
>>>>> On Mon, Apr 29, 2024 at 06:13:33AM -0700, Boris Burkov wrote:
>>>>>> I support the auto deletion in the kernel as you propose, I think it
>>>>>> just makes sense. Who wants stale, empty qgroups around that aren't
>>>>>> attached to any subvol? I suppose that with the drop_thresh thing, it is
>>>>>> possible some parent qgroup still reflects the usage until the next full
>>>>>> scan?
>>>>>
>>>>> The stale qgroups have been out for a long time so removing them after
>>>>> subvolume deletion is changing default behaviour, this always breaks
>>>>> somebody's scripts or tools.
>>>>
>>>> If needed I can introduce a compat bit (the first one), to tell the
>>>> behavior difference.
>>>>
>>>> And if we go the compat bit way, I believe it can be the first example
>>>> on how to do a kernel behavior change correctly without breaking any
>>>> user space assumption.
>>>
>>> I don't see how a compat bit would work here, we use them for feature
>>> compatibility and for general access to data (full or read-only). What
>>> we do with individual behavioral changes are sysfs files. They're
>>> detectable by scripts and can be also used for configuration. In this
>>> case enabling/disabling autoclean of the qgroups.
>>
>> I mean the compat bit, which is fully empty now.
>
> Which compat bit and in which structure? What I mean is the super block
> incompat/compat bits but what you described below does not match it.

Superblock compat bit exactly.

I didn't see why my following description doesn't match it.

We have 3 compat bits:

- incompat
   Huge on-disk format change that older kernel can not understand and
   affects older kernels' ability to read files.
   This is the most abused bits from the past.
   A lot of features like skinny metadata should go compat_ro.

- compat_ro
   On-disk format changes, but older kernels should still be able to read
   the fs/csum tree etc.

- compat
   Not utilized for now. Older kernel can still do everything without any
   problem.
   Thus this should be the best candidate for runtime behavior changes
   that doesn't affect on-disk format at all.

Thanks,
Qu
>
>> The new bit would be something like BTRFS_QGROUP_AUTO_REMOVAL, with that
>> set, btrfs would auto remove any stale qgroups (for regular qgroups though).
>>
>> Without that, it would be as usual (no auto removal).
Qu Wenruo May 2, 2024, 9:29 p.m. UTC | #18
在 2024/5/3 00:33, David Sterba 写道:
> On Wed, May 01, 2024 at 07:57:08AM +0930, Qu Wenruo wrote:
>>
>>
>> 在 2024/5/1 07:48, Boris Burkov 写道:
>>> On Wed, May 01, 2024 at 07:35:09AM +0930, Qu Wenruo wrote:
>> [...]
>>>>>
>>>>> I don't see how a compat bit would work here, we use them for feature
>>>>> compatibility and for general access to data (full or read-only). What
>>>>> we do with individual behavioral changes are sysfs files. They're
>>>>> detectable by scripts and can be also used for configuration. In this
>>>>> case enabling/disabling autoclean of the qgroups.
>>>
>>> This was my initial thought too, but your compat bit idea is interesting
>>> since it persists? I vote sysfs since it has good
>>> infrastructure/momentum already for similar config.
>>
>> Sysfs is another way which we are already utilizing for qgroups, like
>> drop_subtree_threshold.
>>
>> The problem is as mentioned already, it's not persistent, thus it needs
>> a user space daemon to set it for every fs after mount.
>
> My idea of using sysfs is to export the information that the
> autocleaning feature is present and if we make it on by default then
> there's no need for additional step to enable it. The feedback about
> that was that it should have been default so we're going to make that
> change, but with sysfs export also provide a fallback to disable it in
> case it breaks things for somebody.
>
>> I'm totally fine to go sysfs for now, but I really hope to a persistent
>> solution.
>> Maybe a dedicated config tree?
>
> No, we already have a way to store data in the trees or in the
> properties so no new tree.

That means a on-disk format change.
IIRC everytime we introduce a new TEMP objectid, it should at least be
compat or compat_ro bit change.

Or older kernel won't understand nor follow the new TEMP key.

Thanks,
Qu
David Sterba May 3, 2024, 12:46 p.m. UTC | #19
On Fri, May 03, 2024 at 06:59:03AM +0930, Qu Wenruo wrote:
> >> The problem is as mentioned already, it's not persistent, thus it needs
> >> a user space daemon to set it for every fs after mount.
> >
> > My idea of using sysfs is to export the information that the
> > autocleaning feature is present and if we make it on by default then
> > there's no need for additional step to enable it. The feedback about
> > that was that it should have been default so we're going to make that
> > change, but with sysfs export also provide a fallback to disable it in
> > case it breaks things for somebody.
> >
> >> I'm totally fine to go sysfs for now, but I really hope to a persistent
> >> solution.
> >> Maybe a dedicated config tree?
> >
> > No, we already have a way to store data in the trees or in the
> > properties so no new tree.
> 
> That means a on-disk format change.
> IIRC everytime we introduce a new TEMP objectid, it should at least be
> compat or compat_ro bit change.
> 
> Or older kernel won't understand nor follow the new TEMP key.

This is not necessarily a problem, if older kernel does not understand
the key it'll skip it. The search treats the keys as numbers. If there's
something more associated with the additional data that would prevent
mount then a compat bit would make sense.

The compat_ro does not make sense here, the same filesystem can be
mounted and data read or written, the only difference is that qgroups
are also removed. How is that incompatible on the same level as other
features like BGT?
Qu Wenruo May 3, 2024, 10:14 p.m. UTC | #20
在 2024/5/3 22:16, David Sterba 写道:
> On Fri, May 03, 2024 at 06:59:03AM +0930, Qu Wenruo wrote:
>>>> The problem is as mentioned already, it's not persistent, thus it needs
>>>> a user space daemon to set it for every fs after mount.
>>>
>>> My idea of using sysfs is to export the information that the
>>> autocleaning feature is present and if we make it on by default then
>>> there's no need for additional step to enable it. The feedback about
>>> that was that it should have been default so we're going to make that
>>> change, but with sysfs export also provide a fallback to disable it in
>>> case it breaks things for somebody.
>>>
>>>> I'm totally fine to go sysfs for now, but I really hope to a persistent
>>>> solution.
>>>> Maybe a dedicated config tree?
>>>
>>> No, we already have a way to store data in the trees or in the
>>> properties so no new tree.
>>
>> That means a on-disk format change.
>> IIRC everytime we introduce a new TEMP objectid, it should at least be
>> compat or compat_ro bit change.
>>
>> Or older kernel won't understand nor follow the new TEMP key.
>
> This is not necessarily a problem, if older kernel does not understand
> the key it'll skip it. The search treats the keys as numbers. If there's
> something more associated with the additional data that would prevent
> mount then a compat bit would make sense.
>
> The compat_ro does not make sense here, the same filesystem can be
> mounted and data read or written, the only difference is that qgroups
> are also removed. How is that incompatible on the same level as other
> features like BGT?

That's why BGT is compat_ro, not compat.
And exactly why I want to push a new compat bit just for qgroup auto reap.

Thanks,
Qu
diff mbox series

Patch

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 023920d0d971..1e2caa234146 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -5834,6 +5834,7 @@  int btrfs_drop_snapshot(struct btrfs_root *root, int update_ref, int for_reloc)
 	struct btrfs_root_item *root_item = &root->root_item;
 	struct walk_control *wc;
 	struct btrfs_key key;
+	const u64 rootid = btrfs_root_id(root);
 	int err = 0;
 	int ret;
 	int level;
@@ -6064,6 +6065,13 @@  int btrfs_drop_snapshot(struct btrfs_root *root, int update_ref, int for_reloc)
 	kfree(wc);
 	btrfs_free_path(path);
 out:
+	if (!err && root_dropped) {
+		ret = btrfs_qgroup_cleanup_dropped_subvolume(fs_info, rootid);
+		if (ret < 0)
+			btrfs_warn_rl(fs_info,
+				      "failed to cleanup qgroup 0/%llu: %d",
+				      rootid, ret);
+	}
 	/*
 	 * We were an unfinished drop root, check to see if there are any
 	 * pending, and if not clear and wake up any waiters.
diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index 2ea16a07a7d4..9aeb740388ab 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -1859,6 +1859,39 @@  int btrfs_remove_qgroup(struct btrfs_trans_handle *trans, u64 qgroupid)
 	return ret;
 }
 
+int btrfs_qgroup_cleanup_dropped_subvolume(struct btrfs_fs_info *fs_info,
+					   u64 subvolid)
+{
+	struct btrfs_trans_handle *trans;
+	int ret;
+
+	if (!is_fstree(subvolid) || !btrfs_qgroup_enabled(fs_info) ||
+	    !fs_info->quota_root)
+		return 0;
+
+	/*
+	 * If our qgroup is consistent, commit current transaction to make sure
+	 * all the rfer/excl numbers get updated to 0 before deleting.
+	 */
+	if (!(fs_info->qgroup_flags & BTRFS_QGROUP_STATUS_FLAG_INCONSISTENT)) {
+		trans = btrfs_start_transaction(fs_info->quota_root, 0);
+		if (IS_ERR(trans))
+			return PTR_ERR(trans);
+
+		ret = btrfs_commit_transaction(trans);
+		if (ret < 0)
+			return ret;
+	}
+
+	/* Start new trans to delete the qgroup info and limit items. */
+	trans = btrfs_start_transaction(fs_info->quota_root, 2);
+	if (IS_ERR(trans))
+		return PTR_ERR(trans);
+	ret = btrfs_remove_qgroup(trans, subvolid);
+	btrfs_end_transaction(trans);
+	return ret;
+}
+
 int btrfs_limit_qgroup(struct btrfs_trans_handle *trans, u64 qgroupid,
 		       struct btrfs_qgroup_limit *limit)
 {
@@ -4877,7 +4910,11 @@  int btrfs_record_squota_delta(struct btrfs_fs_info *fs_info,
 	spin_lock(&fs_info->qgroup_lock);
 	qgroup = find_qgroup_rb(fs_info, root);
 	if (!qgroup) {
-		ret = -ENOENT;
+		/*
+		 * The qgroup can be auto deleted by subvolume deleting.
+		 * In that case do not consider it an error.
+		 */
+		ret = 0;
 		goto out;
 	}
 
diff --git a/fs/btrfs/qgroup.h b/fs/btrfs/qgroup.h
index 706640be0ec2..3f93856a02e1 100644
--- a/fs/btrfs/qgroup.h
+++ b/fs/btrfs/qgroup.h
@@ -327,6 +327,8 @@  int btrfs_del_qgroup_relation(struct btrfs_trans_handle *trans, u64 src,
 			      u64 dst);
 int btrfs_create_qgroup(struct btrfs_trans_handle *trans, u64 qgroupid);
 int btrfs_remove_qgroup(struct btrfs_trans_handle *trans, u64 qgroupid);
+int btrfs_qgroup_cleanup_dropped_subvolume(struct btrfs_fs_info *fs_info,
+					   u64 subvolid);
 int btrfs_limit_qgroup(struct btrfs_trans_handle *trans, u64 qgroupid,
 		       struct btrfs_qgroup_limit *limit);
 int btrfs_read_qgroup_config(struct btrfs_fs_info *fs_info);