Message ID | 07e54de6747a5bf1e6a422cba80cbb06ba832cf4.1713519718.git.wqu@suse.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | btrfs: qgroup: stale qgroups related impromvents | expand |
On Fri, Apr 19, 2024 at 07:16:53PM +0930, Qu Wenruo wrote: > Currently if we fully removed a subvolume (not only unlinked, but fully > dropped its root item), its qgroup would not be removed. > > Thus we have "btrfs qgroup clear-stale" to handle such 0 level qgroups. There's also an option 'btrfs subvolume delete --delete-qgroup' that does that and is going to be default in 6.9. With this kernel change it would break the behaviour of the --no-delete-qgroup, which is there for the case something depends on that. For now I'd rather postpone changing the kernel behaviour.
在 2024/4/24 22:11, David Sterba 写道: > On Fri, Apr 19, 2024 at 07:16:53PM +0930, Qu Wenruo wrote: >> Currently if we fully removed a subvolume (not only unlinked, but fully >> dropped its root item), its qgroup would not be removed. >> >> Thus we have "btrfs qgroup clear-stale" to handle such 0 level qgroups. > > There's also an option 'btrfs subvolume delete --delete-qgroup' that > does that and is going to be default in 6.9. With this kernel change it > would break the behaviour of the --no-delete-qgroup, which is there for > the case something depends on that. For now I'd rather postpone > changing the kernel behaviour. > A quick glance of the --delete-qgroup shows it won't work as expected at all. Firstly, the qgroup delete requires the qgroup numbers to be 0. Meanwhile qgroup numbers can only be 0 after 1) the full subvolume has been dropped 2) a transaction is committed to reflect the qgroup numbers. Both situation is only handled in my patchset, thus this means for a lot of cases it won't work at all. Furthermore, there is the drop_subtree_threshold thing, which can mark qgroup inconsistent and skip accounting, making the target subvolume's qgroup numbers never fall back to 0 (until next rescan). So I'm afraid the --delete-qgroup won't work until the 1/2 patch get merged (allowing deleting qgroups as long as the target subvolume is gone). Thanks, Qu
On Thu, Apr 25, 2024 at 07:49:12AM +0930, Qu Wenruo wrote: > > > 在 2024/4/24 22:11, David Sterba 写道: > > On Fri, Apr 19, 2024 at 07:16:53PM +0930, Qu Wenruo wrote: > >> Currently if we fully removed a subvolume (not only unlinked, but fully > >> dropped its root item), its qgroup would not be removed. > >> > >> Thus we have "btrfs qgroup clear-stale" to handle such 0 level qgroups. > > > > There's also an option 'btrfs subvolume delete --delete-qgroup' that > > does that and is going to be default in 6.9. With this kernel change it > > would break the behaviour of the --no-delete-qgroup, which is there for > > the case something depends on that. For now I'd rather postpone > > changing the kernel behaviour. > > > > A quick glance of the --delete-qgroup shows it won't work as expected at > all. > > Firstly, the qgroup delete requires the qgroup numbers to be 0. > Meanwhile qgroup numbers can only be 0 after 1) the full subvolume has > been dropped 2) a transaction is committed to reflect the qgroup numbers. The deletion option calls ioctl, so this means that 'btrfs qgroup remove' will not delete it either? > Both situation is only handled in my patchset, thus this means for a lot > of cases it won't work at all. > > Furthermore, there is the drop_subtree_threshold thing, which can mark > qgroup inconsistent and skip accounting, making the target subvolume's > qgroup numbers never fall back to 0 (until next rescan). > > So I'm afraid the --delete-qgroup won't work until the 1/2 patch get > merged (allowing deleting qgroups as long as the target subvolume is gone). Ok, so for emulation of the complete removal in userspace it's btrfs subvolume delete 123 btrfs subvolume sync 123 btrfs qgroup remove 0/123 but this needs to wait until the sync is finished and that is not expected for the subvolume delete command. It needs to be fixed but now I'm not sure this can be default in 6.9 as planned.
在 2024/4/25 22:04, David Sterba 写道: > On Thu, Apr 25, 2024 at 07:49:12AM +0930, Qu Wenruo wrote: >> >> >> 在 2024/4/24 22:11, David Sterba 写道: >>> On Fri, Apr 19, 2024 at 07:16:53PM +0930, Qu Wenruo wrote: >>>> Currently if we fully removed a subvolume (not only unlinked, but fully >>>> dropped its root item), its qgroup would not be removed. >>>> >>>> Thus we have "btrfs qgroup clear-stale" to handle such 0 level qgroups. >>> >>> There's also an option 'btrfs subvolume delete --delete-qgroup' that >>> does that and is going to be default in 6.9. With this kernel change it >>> would break the behaviour of the --no-delete-qgroup, which is there for >>> the case something depends on that. For now I'd rather postpone >>> changing the kernel behaviour. >>> >> >> A quick glance of the --delete-qgroup shows it won't work as expected at >> all. >> >> Firstly, the qgroup delete requires the qgroup numbers to be 0. >> Meanwhile qgroup numbers can only be 0 after 1) the full subvolume has >> been dropped 2) a transaction is committed to reflect the qgroup numbers. > > The deletion option calls ioctl, so this means that 'btrfs qgroup remove' > will not delete it either? Nope, at least if the subvolume is not cleaned up immediately. > >> Both situation is only handled in my patchset, thus this means for a lot >> of cases it won't work at all. >> >> Furthermore, there is the drop_subtree_threshold thing, which can mark >> qgroup inconsistent and skip accounting, making the target subvolume's >> qgroup numbers never fall back to 0 (until next rescan). >> >> So I'm afraid the --delete-qgroup won't work until the 1/2 patch get >> merged (allowing deleting qgroups as long as the target subvolume is gone). > > Ok, so for emulation of the complete removal in userspace it's > > btrfs subvolume delete 123 > btrfs subvolume sync 123 > btrfs qgroup remove 0/123 > > but this needs to wait until the sync is finished and that is not > expected for the subvolume delete command. That's the problem, and why doing it in user space has it limits. Furthermore, with drop_subtree_threshold or other qgroup operations marking the qgroup inconsistent, you can not delete that qgroup at all, until the next rescan. > It needs to be fixed but now > I'm not sure this can be default in 6.9 as planned. I'd say, you should not implement this feature without really understanding the challenges in the first place. And that's why I really prefer you send out non-trivial btrfs-progs for review, other than pushing them directly into github repo. Thanks, Qu
On Fri, Apr 19, 2024 at 07:16:53PM +0930, Qu Wenruo wrote: > Currently if we fully removed a subvolume (not only unlinked, but fully > dropped its root item), its qgroup would not be removed. > > Thus we have "btrfs qgroup clear-stale" to handle such 0 level qgroups. > > This patch changes the behavior by automatically removing the qgroup of > a fully dropped subvolume. > > There is an exception for simple quota, that btrfs_record_squota_delta() > has to handle missing qgroup case, where the target delta belongs to an > automatically deleted subvolume. > In that case since the subvolume is already gone, no need to treat it as > an error. > > Link: https://bugzilla.suse.com/show_bug.cgi?id=1222847 > Signed-off-by: Qu Wenruo <wqu@suse.com> > --- > fs/btrfs/extent-tree.c | 8 ++++++++ > fs/btrfs/qgroup.c | 39 ++++++++++++++++++++++++++++++++++++++- > fs/btrfs/qgroup.h | 2 ++ > 3 files changed, 48 insertions(+), 1 deletion(-) > > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > index 023920d0d971..1e2caa234146 100644 > --- a/fs/btrfs/extent-tree.c > +++ b/fs/btrfs/extent-tree.c > @@ -5834,6 +5834,7 @@ int btrfs_drop_snapshot(struct btrfs_root *root, int update_ref, int for_reloc) > struct btrfs_root_item *root_item = &root->root_item; > struct walk_control *wc; > struct btrfs_key key; > + const u64 rootid = btrfs_root_id(root); > int err = 0; > int ret; > int level; > @@ -6064,6 +6065,13 @@ int btrfs_drop_snapshot(struct btrfs_root *root, int update_ref, int for_reloc) > kfree(wc); > btrfs_free_path(path); > out: > + if (!err && root_dropped) { > + ret = btrfs_qgroup_cleanup_dropped_subvolume(fs_info, rootid); > + if (ret < 0) > + btrfs_warn_rl(fs_info, > + "failed to cleanup qgroup 0/%llu: %d", > + rootid, ret); > + } > /* > * We were an unfinished drop root, check to see if there are any > * pending, and if not clear and wake up any waiters. > diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c > index 2ea16a07a7d4..9aeb740388ab 100644 > --- a/fs/btrfs/qgroup.c > +++ b/fs/btrfs/qgroup.c > @@ -1859,6 +1859,39 @@ int btrfs_remove_qgroup(struct btrfs_trans_handle *trans, u64 qgroupid) > return ret; > } > > +int btrfs_qgroup_cleanup_dropped_subvolume(struct btrfs_fs_info *fs_info, > + u64 subvolid) > +{ > + struct btrfs_trans_handle *trans; > + int ret; > + > + if (!is_fstree(subvolid) || !btrfs_qgroup_enabled(fs_info) || > + !fs_info->quota_root) > + return 0; > + > + /* > + * If our qgroup is consistent, commit current transaction to make sure > + * all the rfer/excl numbers get updated to 0 before deleting. > + */ > + if (!(fs_info->qgroup_flags & BTRFS_QGROUP_STATUS_FLAG_INCONSISTENT)) { > + trans = btrfs_start_transaction(fs_info->quota_root, 0); > + if (IS_ERR(trans)) > + return PTR_ERR(trans); > + > + ret = btrfs_commit_transaction(trans); > + if (ret < 0) > + return ret; > + } > + > + /* Start new trans to delete the qgroup info and limit items. */ > + trans = btrfs_start_transaction(fs_info->quota_root, 2); > + if (IS_ERR(trans)) > + return PTR_ERR(trans); > + ret = btrfs_remove_qgroup(trans, subvolid); > + btrfs_end_transaction(trans); > + return ret; > +} > + > int btrfs_limit_qgroup(struct btrfs_trans_handle *trans, u64 qgroupid, > struct btrfs_qgroup_limit *limit) > { > @@ -4877,7 +4910,11 @@ int btrfs_record_squota_delta(struct btrfs_fs_info *fs_info, > spin_lock(&fs_info->qgroup_lock); > qgroup = find_qgroup_rb(fs_info, root); > if (!qgroup) { > - ret = -ENOENT; > + /* > + * The qgroup can be auto deleted by subvolume deleting. > + * In that case do not consider it an error. > + */ > + ret = 0; If we call record_squota_delta, that means we are adding an extent to the tree whose owner ref will point at this subvol. Which means that we are entering into a case where our on disk accounting disagrees between qgroup and extent tree. This should fail the squota fsck logic, if you really trigger it. How were you hitting this? In some mundane way where the extent actually gets dropped? I worry that if the extent also picks up a reference in the same transaction in a way that the owner is the dropped subvol, we could get into inconsistency. Ideally, the checks on rsv/usage in drop also prevent that.. > goto out; > } > > diff --git a/fs/btrfs/qgroup.h b/fs/btrfs/qgroup.h > index 706640be0ec2..3f93856a02e1 100644 > --- a/fs/btrfs/qgroup.h > +++ b/fs/btrfs/qgroup.h > @@ -327,6 +327,8 @@ int btrfs_del_qgroup_relation(struct btrfs_trans_handle *trans, u64 src, > u64 dst); > int btrfs_create_qgroup(struct btrfs_trans_handle *trans, u64 qgroupid); > int btrfs_remove_qgroup(struct btrfs_trans_handle *trans, u64 qgroupid); > +int btrfs_qgroup_cleanup_dropped_subvolume(struct btrfs_fs_info *fs_info, > + u64 subvolid); > int btrfs_limit_qgroup(struct btrfs_trans_handle *trans, u64 qgroupid, > struct btrfs_qgroup_limit *limit); > int btrfs_read_qgroup_config(struct btrfs_fs_info *fs_info); > -- > 2.44.0 >
On Fri, Apr 26, 2024 at 07:21:08AM +0930, Qu Wenruo wrote: > > > 在 2024/4/25 22:04, David Sterba 写道: > > On Thu, Apr 25, 2024 at 07:49:12AM +0930, Qu Wenruo wrote: > > > > > > > > > 在 2024/4/24 22:11, David Sterba 写道: > > > > On Fri, Apr 19, 2024 at 07:16:53PM +0930, Qu Wenruo wrote: > > > > > Currently if we fully removed a subvolume (not only unlinked, but fully > > > > > dropped its root item), its qgroup would not be removed. > > > > > > > > > > Thus we have "btrfs qgroup clear-stale" to handle such 0 level qgroups. > > > > > > > > There's also an option 'btrfs subvolume delete --delete-qgroup' that > > > > does that and is going to be default in 6.9. With this kernel change it > > > > would break the behaviour of the --no-delete-qgroup, which is there for > > > > the case something depends on that. For now I'd rather postpone > > > > changing the kernel behaviour. > > > > > > > > > > A quick glance of the --delete-qgroup shows it won't work as expected at > > > all. > > > > > > Firstly, the qgroup delete requires the qgroup numbers to be 0. > > > Meanwhile qgroup numbers can only be 0 after 1) the full subvolume has > > > been dropped 2) a transaction is committed to reflect the qgroup numbers. > > > > The deletion option calls ioctl, so this means that 'btrfs qgroup remove' > > will not delete it either? > > Nope, at least if the subvolume is not cleaned up immediately. > > > > > Both situation is only handled in my patchset, thus this means for a lot > > > of cases it won't work at all. > > > > > > Furthermore, there is the drop_subtree_threshold thing, which can mark > > > qgroup inconsistent and skip accounting, making the target subvolume's > > > qgroup numbers never fall back to 0 (until next rescan). > > > > > > So I'm afraid the --delete-qgroup won't work until the 1/2 patch get > > > merged (allowing deleting qgroups as long as the target subvolume is gone). > > > > Ok, so for emulation of the complete removal in userspace it's > > > > btrfs subvolume delete 123 > > btrfs subvolume sync 123 > > btrfs qgroup remove 0/123 > > > > but this needs to wait until the sync is finished and that is not > > expected for the subvolume delete command. > > That's the problem, and why doing it in user space has it limits. > > Furthermore, with drop_subtree_threshold or other qgroup operations > marking the qgroup inconsistent, you can not delete that qgroup at all, > until the next rescan. > > > It needs to be fixed but now > > I'm not sure this can be default in 6.9 as planned. > > I'd say, you should not implement this feature without really > understanding the challenges in the first place. > > And that's why I really prefer you send out non-trivial btrfs-progs for > review, other than pushing them directly into github repo. > > Thanks, > Qu I support the auto deletion in the kernel as you propose, I think it just makes sense. Who wants stale, empty qgroups around that aren't attached to any subvol? I suppose that with the drop_thresh thing, it is possible some parent qgroup still reflects the usage until the next full scan? Thinking out loud -- for regular qgroups, we could avoid this all if we do the reaping when usage goes to 0 and there is no subvol. So remove the qgroup as a consequence of the rescan, not the subvol delete. I imagine this is quite a bit messier, though :( We could also just not auto-reap if that condition occurs (inconsistent qg with a parent), but I think that may be surprising for the same reasons that got you working on this in the first place... Do we know of an explicit need to support --no-delete-qgroup? It feels like it is perfectly normal for us to improve the default behavior of the kernel or userspace tools without supporting the old behavior as a flag forever (without a user). Put another way, I think it would be perfectly reasonable to term the stale qgroups a leaked memory resource and this patch a bug fix, if we are willing to get overly philosophical about it. We don't carry around eternal flags for bug fixes, unless it's some rather exceptional case. If we are on the hook for supporing that flag because we already added it to progs and don't want to deprecate it, then maybe we can think of something compatible with default auto-reap. Thanks, Boris
On Mon, Apr 29, 2024 at 06:13:33AM -0700, Boris Burkov wrote: > I support the auto deletion in the kernel as you propose, I think it > just makes sense. Who wants stale, empty qgroups around that aren't > attached to any subvol? I suppose that with the drop_thresh thing, it is > possible some parent qgroup still reflects the usage until the next full > scan? The stale qgroups have been out for a long time so removing them after subvolume deletion is changing default behaviour, this always breaks somebody's scripts or tools. > Thinking out loud -- for regular qgroups, we could avoid this all if we > do the reaping when usage goes to 0 and there is no subvol. So remove > the qgroup as a consequence of the rescan, not the subvol delete. I > imagine this is quite a bit messier, though :( > > We could also just not auto-reap if that condition occurs (inconsistent > qg with a parent), but I think that may be surprising for the same > reasons that got you working on this in the first place... > > Do we know of an explicit need to support --no-delete-qgroup? It feels > like it is perfectly normal for us to improve the default behavior of > the kernel or userspace tools without supporting the old behavior as a > flag forever (without a user). $ id=$(btrfs inspect rootid subvol) $ btrfs subvolume delete subvol $ btrfs qgroup remove 0/$id 1/1 . <---- fails $ btrfs qgroup destroy 0/$id . <---- fails > Put another way, I think it would be perfectly reasonable to term the > stale qgroups a leaked memory resource and this patch a bug fix, if we > are willing to get overly philosophical about it. We don't carry around > eternal flags for bug fixes, unless it's some rather exceptional case. The command line option does not do what I expected, if somebody would have to update the scripts to add it then we can do the kernel auto-remove and the document it. Eventually we can translate the -ENOENT error code to be ignored. > If we are on the hook for supporing that flag because we already added > it to progs and don't want to deprecate it, then maybe we can think of > something compatible with default auto-reap.
On Fri, Apr 26, 2024 at 07:21:08AM +0930, Qu Wenruo wrote: > > btrfs subvolume delete 123 > > btrfs subvolume sync 123 > > btrfs qgroup remove 0/123 > > > > but this needs to wait until the sync is finished and that is not > > expected for the subvolume delete command. > > That's the problem, and why doing it in user space has it limits. > > Furthermore, with drop_subtree_threshold or other qgroup operations > marking the qgroup inconsistent, you can not delete that qgroup at all, > until the next rescan. Ok so that makes it more complicated and better solved in kernel, with the compatibility issues mentioned in the other mail. > > It needs to be fixed but now > > I'm not sure this can be default in 6.9 as planned. > > I'd say, you should not implement this feature without really > understanding the challenges in the first place. > > And that's why I really prefer you send out non-trivial btrfs-progs for > review, other than pushing them directly into github repo. As discussed on slack, the development happens in git, for btrfs-progs the mails are overhead that does not pay off.
在 2024/4/30 02:01, David Sterba 写道: > On Mon, Apr 29, 2024 at 06:13:33AM -0700, Boris Burkov wrote: >> I support the auto deletion in the kernel as you propose, I think it >> just makes sense. Who wants stale, empty qgroups around that aren't >> attached to any subvol? I suppose that with the drop_thresh thing, it is >> possible some parent qgroup still reflects the usage until the next full >> scan? > > The stale qgroups have been out for a long time so removing them after > subvolume deletion is changing default behaviour, this always breaks > somebody's scripts or tools. If needed I can introduce a compat bit (the first one), to tell the behavior difference. And if we go the compat bit way, I believe it can be the first example on how to do a kernel behavior change correctly without breaking any user space assumption. Thanks, Qu > >> Thinking out loud -- for regular qgroups, we could avoid this all if we >> do the reaping when usage goes to 0 and there is no subvol. So remove >> the qgroup as a consequence of the rescan, not the subvol delete. I >> imagine this is quite a bit messier, though :( >> >> We could also just not auto-reap if that condition occurs (inconsistent >> qg with a parent), but I think that may be surprising for the same >> reasons that got you working on this in the first place... >> >> Do we know of an explicit need to support --no-delete-qgroup? It feels >> like it is perfectly normal for us to improve the default behavior of >> the kernel or userspace tools without supporting the old behavior as a >> flag forever (without a user). > > $ id=$(btrfs inspect rootid subvol) > $ btrfs subvolume delete subvol > $ btrfs qgroup remove 0/$id 1/1 . <---- fails > $ btrfs qgroup destroy 0/$id . <---- fails > >> Put another way, I think it would be perfectly reasonable to term the >> stale qgroups a leaked memory resource and this patch a bug fix, if we >> are willing to get overly philosophical about it. We don't carry around >> eternal flags for bug fixes, unless it's some rather exceptional case. > > The command line option does not do what I expected, if somebody would > have to update the scripts to add it then we can do the kernel > auto-remove and the document it. Eventually we can translate the -ENOENT > error code to be ignored. > >> If we are on the hook for supporing that flag because we already added >> it to progs and don't want to deprecate it, then maybe we can think of >> something compatible with default auto-reap.
在 2024/4/29 22:43, Boris Burkov 写道: > On Fri, Apr 26, 2024 at 07:21:08AM +0930, Qu Wenruo wrote: [...] >> I'd say, you should not implement this feature without really >> understanding the challenges in the first place. >> >> And that's why I really prefer you send out non-trivial btrfs-progs for >> review, other than pushing them directly into github repo. >> >> Thanks, >> Qu > > I support the auto deletion in the kernel as you propose, I think it > just makes sense. Who wants stale, empty qgroups around that aren't > attached to any subvol? I suppose that with the drop_thresh thing, it is > possible some parent qgroup still reflects the usage until the next full > scan? Yep, that's the problem. Although I can also do the extra dropping for regular qgroups, but since it's already inconsistent, modifying the numbers won't make much difference anyway. > > Thinking out loud -- for regular qgroups, we could avoid this all if we > do the reaping when usage goes to 0 and there is no subvol. So remove > the qgroup as a consequence of the rescan, not the subvol delete. I > imagine this is quite a bit messier, though :( This may be a little more complex, we need to do the extra check at qgroup accounting time. And IIRC there may be some corner case that numbers of a qgroup dropped to 0 temporarily, but later extents accounting would increase it back. Another concern is related to rescan. Rescan would mark all qgroup to have 0 numbers first, then let rescan to refill them all. This means for a fully dropped subvolume in INCONSISTENT mode, there is no other timing other than rescan setting its numbers to 0. So dropping the qgroup when its number goes back to 0 won't happen for it. > > We could also just not auto-reap if that condition occurs (inconsistent > qg with a parent), but I think that may be surprising for the same > reasons that got you working on this in the first place... > > Do we know of an explicit need to support --no-delete-qgroup? It feels > like it is perfectly normal for us to improve the default behavior of > the kernel or userspace tools without supporting the old behavior as a > flag forever (without a user). I do not think any one would intentionally prefer a stale one. But considering if squota requires a stale qgroup to keep the accounting correct, the difference between regular qgroup and squota may grow larger and larger in the future. I'll need to re-consider the qgroup drop condition to make it only possible to delete really stale qgroups (all zero and no subvolume for both regular qgroup/squota, but for inconsistent regular qgroup, allowing dropping non-zero qgroups). > > Put another way, I think it would be perfectly reasonable to term the > stale qgroups a leaked memory resource and this patch a bug fix, if we > are willing to get overly philosophical about it. We don't carry around > eternal flags for bug fixes, unless it's some rather exceptional case. > > If we are on the hook for supporing that flag because we already added > it to progs and don't want to deprecate it, then maybe we can think of > something compatible with default auto-reap. Like a compat bit to represent if we do the auto-reap. This would definitely make older script happy. But I think it's a little overkilled. Thanks, Qu > > Thanks, > Boris >
On Tue, Apr 30, 2024 at 07:35:11AM +0930, Qu Wenruo wrote: > > > 在 2024/4/30 02:01, David Sterba 写道: > > On Mon, Apr 29, 2024 at 06:13:33AM -0700, Boris Burkov wrote: > >> I support the auto deletion in the kernel as you propose, I think it > >> just makes sense. Who wants stale, empty qgroups around that aren't > >> attached to any subvol? I suppose that with the drop_thresh thing, it is > >> possible some parent qgroup still reflects the usage until the next full > >> scan? > > > > The stale qgroups have been out for a long time so removing them after > > subvolume deletion is changing default behaviour, this always breaks > > somebody's scripts or tools. > > If needed I can introduce a compat bit (the first one), to tell the > behavior difference. > > And if we go the compat bit way, I believe it can be the first example > on how to do a kernel behavior change correctly without breaking any > user space assumption. I don't see how a compat bit would work here, we use them for feature compatibility and for general access to data (full or read-only). What we do with individual behavioral changes are sysfs files. They're detectable by scripts and can be also used for configuration. In this case enabling/disabling autoclean of the qgroups.
在 2024/4/30 20:29, David Sterba 写道: > On Tue, Apr 30, 2024 at 07:35:11AM +0930, Qu Wenruo wrote: >> >> >> 在 2024/4/30 02:01, David Sterba 写道: >>> On Mon, Apr 29, 2024 at 06:13:33AM -0700, Boris Burkov wrote: >>>> I support the auto deletion in the kernel as you propose, I think it >>>> just makes sense. Who wants stale, empty qgroups around that aren't >>>> attached to any subvol? I suppose that with the drop_thresh thing, it is >>>> possible some parent qgroup still reflects the usage until the next full >>>> scan? >>> >>> The stale qgroups have been out for a long time so removing them after >>> subvolume deletion is changing default behaviour, this always breaks >>> somebody's scripts or tools. >> >> If needed I can introduce a compat bit (the first one), to tell the >> behavior difference. >> >> And if we go the compat bit way, I believe it can be the first example >> on how to do a kernel behavior change correctly without breaking any >> user space assumption. > > I don't see how a compat bit would work here, we use them for feature > compatibility and for general access to data (full or read-only). What > we do with individual behavioral changes are sysfs files. They're > detectable by scripts and can be also used for configuration. In this > case enabling/disabling autoclean of the qgroups. > I mean the compat bit, which is fully empty now. The new bit would be something like BTRFS_QGROUP_AUTO_REMOVAL, with that set, btrfs would auto remove any stale qgroups (for regular qgroups though). Without that, it would be as usual (no auto removal). Since this doesn't cause any on-disk change, it does not needs compat-ro nor incompat. Thanks, Qu
On Wed, May 01, 2024 at 07:35:09AM +0930, Qu Wenruo wrote: > > > 在 2024/4/30 20:29, David Sterba 写道: > > On Tue, Apr 30, 2024 at 07:35:11AM +0930, Qu Wenruo wrote: > > > > > > > > > 在 2024/4/30 02:01, David Sterba 写道: > > > > On Mon, Apr 29, 2024 at 06:13:33AM -0700, Boris Burkov wrote: > > > > > I support the auto deletion in the kernel as you propose, I think it > > > > > just makes sense. Who wants stale, empty qgroups around that aren't > > > > > attached to any subvol? I suppose that with the drop_thresh thing, it is > > > > > possible some parent qgroup still reflects the usage until the next full > > > > > scan? > > > > > > > > The stale qgroups have been out for a long time so removing them after > > > > subvolume deletion is changing default behaviour, this always breaks > > > > somebody's scripts or tools. > > > > > > If needed I can introduce a compat bit (the first one), to tell the > > > behavior difference. > > > > > > And if we go the compat bit way, I believe it can be the first example > > > on how to do a kernel behavior change correctly without breaking any > > > user space assumption. > > > > I don't see how a compat bit would work here, we use them for feature > > compatibility and for general access to data (full or read-only). What > > we do with individual behavioral changes are sysfs files. They're > > detectable by scripts and can be also used for configuration. In this > > case enabling/disabling autoclean of the qgroups. This was my initial thought too, but your compat bit idea is interesting since it persists? I vote sysfs since it has good infrastructure/momentum already for similar config. > > > > I mean the compat bit, which is fully empty now. > > The new bit would be something like BTRFS_QGROUP_AUTO_REMOVAL, with that > set, btrfs would auto remove any stale qgroups (for regular qgroups though). > > Without that, it would be as usual (no auto removal). > > Since this doesn't cause any on-disk change, it does not needs compat-ro > nor incompat. > > Thanks, > Qu
在 2024/5/1 07:48, Boris Burkov 写道: > On Wed, May 01, 2024 at 07:35:09AM +0930, Qu Wenruo wrote: [...] >>> >>> I don't see how a compat bit would work here, we use them for feature >>> compatibility and for general access to data (full or read-only). What >>> we do with individual behavioral changes are sysfs files. They're >>> detectable by scripts and can be also used for configuration. In this >>> case enabling/disabling autoclean of the qgroups. > > This was my initial thought too, but your compat bit idea is interesting > since it persists? I vote sysfs since it has good > infrastructure/momentum already for similar config. Sysfs is another way which we are already utilizing for qgroups, like drop_subtree_threshold. The problem is as mentioned already, it's not persistent, thus it needs a user space daemon to set it for every fs after mount. I'm totally fine to go sysfs for now, but I really hope to a persistent solution. Maybe a dedicated config tree? Thanks, Qu > >>> >> >> I mean the compat bit, which is fully empty now. >> >> The new bit would be something like BTRFS_QGROUP_AUTO_REMOVAL, with that >> set, btrfs would auto remove any stale qgroups (for regular qgroups though). >> >> Without that, it would be as usual (no auto removal). >> >> Since this doesn't cause any on-disk change, it does not needs compat-ro >> nor incompat. >> >> Thanks, >> Qu
On Wed, May 01, 2024 at 07:35:09AM +0930, Qu Wenruo wrote: > > > 在 2024/4/30 20:29, David Sterba 写道: > > On Tue, Apr 30, 2024 at 07:35:11AM +0930, Qu Wenruo wrote: > >> > >> > >> 在 2024/4/30 02:01, David Sterba 写道: > >>> On Mon, Apr 29, 2024 at 06:13:33AM -0700, Boris Burkov wrote: > >>>> I support the auto deletion in the kernel as you propose, I think it > >>>> just makes sense. Who wants stale, empty qgroups around that aren't > >>>> attached to any subvol? I suppose that with the drop_thresh thing, it is > >>>> possible some parent qgroup still reflects the usage until the next full > >>>> scan? > >>> > >>> The stale qgroups have been out for a long time so removing them after > >>> subvolume deletion is changing default behaviour, this always breaks > >>> somebody's scripts or tools. > >> > >> If needed I can introduce a compat bit (the first one), to tell the > >> behavior difference. > >> > >> And if we go the compat bit way, I believe it can be the first example > >> on how to do a kernel behavior change correctly without breaking any > >> user space assumption. > > > > I don't see how a compat bit would work here, we use them for feature > > compatibility and for general access to data (full or read-only). What > > we do with individual behavioral changes are sysfs files. They're > > detectable by scripts and can be also used for configuration. In this > > case enabling/disabling autoclean of the qgroups. > > I mean the compat bit, which is fully empty now. Which compat bit and in which structure? What I mean is the super block incompat/compat bits but what you described below does not match it. > The new bit would be something like BTRFS_QGROUP_AUTO_REMOVAL, with that > set, btrfs would auto remove any stale qgroups (for regular qgroups though). > > Without that, it would be as usual (no auto removal).
On Wed, May 01, 2024 at 07:57:08AM +0930, Qu Wenruo wrote: > > > 在 2024/5/1 07:48, Boris Burkov 写道: > > On Wed, May 01, 2024 at 07:35:09AM +0930, Qu Wenruo wrote: > [...] > >>> > >>> I don't see how a compat bit would work here, we use them for feature > >>> compatibility and for general access to data (full or read-only). What > >>> we do with individual behavioral changes are sysfs files. They're > >>> detectable by scripts and can be also used for configuration. In this > >>> case enabling/disabling autoclean of the qgroups. > > > > This was my initial thought too, but your compat bit idea is interesting > > since it persists? I vote sysfs since it has good > > infrastructure/momentum already for similar config. > > Sysfs is another way which we are already utilizing for qgroups, like > drop_subtree_threshold. > > The problem is as mentioned already, it's not persistent, thus it needs > a user space daemon to set it for every fs after mount. My idea of using sysfs is to export the information that the autocleaning feature is present and if we make it on by default then there's no need for additional step to enable it. The feedback about that was that it should have been default so we're going to make that change, but with sysfs export also provide a fallback to disable it in case it breaks things for somebody. > I'm totally fine to go sysfs for now, but I really hope to a persistent > solution. > Maybe a dedicated config tree? No, we already have a way to store data in the trees or in the properties so no new tree.
在 2024/5/3 00:30, David Sterba 写道: > On Wed, May 01, 2024 at 07:35:09AM +0930, Qu Wenruo wrote: >> >> >> 在 2024/4/30 20:29, David Sterba 写道: >>> On Tue, Apr 30, 2024 at 07:35:11AM +0930, Qu Wenruo wrote: >>>> >>>> >>>> 在 2024/4/30 02:01, David Sterba 写道: >>>>> On Mon, Apr 29, 2024 at 06:13:33AM -0700, Boris Burkov wrote: >>>>>> I support the auto deletion in the kernel as you propose, I think it >>>>>> just makes sense. Who wants stale, empty qgroups around that aren't >>>>>> attached to any subvol? I suppose that with the drop_thresh thing, it is >>>>>> possible some parent qgroup still reflects the usage until the next full >>>>>> scan? >>>>> >>>>> The stale qgroups have been out for a long time so removing them after >>>>> subvolume deletion is changing default behaviour, this always breaks >>>>> somebody's scripts or tools. >>>> >>>> If needed I can introduce a compat bit (the first one), to tell the >>>> behavior difference. >>>> >>>> And if we go the compat bit way, I believe it can be the first example >>>> on how to do a kernel behavior change correctly without breaking any >>>> user space assumption. >>> >>> I don't see how a compat bit would work here, we use them for feature >>> compatibility and for general access to data (full or read-only). What >>> we do with individual behavioral changes are sysfs files. They're >>> detectable by scripts and can be also used for configuration. In this >>> case enabling/disabling autoclean of the qgroups. >> >> I mean the compat bit, which is fully empty now. > > Which compat bit and in which structure? What I mean is the super block > incompat/compat bits but what you described below does not match it. Superblock compat bit exactly. I didn't see why my following description doesn't match it. We have 3 compat bits: - incompat Huge on-disk format change that older kernel can not understand and affects older kernels' ability to read files. This is the most abused bits from the past. A lot of features like skinny metadata should go compat_ro. - compat_ro On-disk format changes, but older kernels should still be able to read the fs/csum tree etc. - compat Not utilized for now. Older kernel can still do everything without any problem. Thus this should be the best candidate for runtime behavior changes that doesn't affect on-disk format at all. Thanks, Qu > >> The new bit would be something like BTRFS_QGROUP_AUTO_REMOVAL, with that >> set, btrfs would auto remove any stale qgroups (for regular qgroups though). >> >> Without that, it would be as usual (no auto removal).
在 2024/5/3 00:33, David Sterba 写道: > On Wed, May 01, 2024 at 07:57:08AM +0930, Qu Wenruo wrote: >> >> >> 在 2024/5/1 07:48, Boris Burkov 写道: >>> On Wed, May 01, 2024 at 07:35:09AM +0930, Qu Wenruo wrote: >> [...] >>>>> >>>>> I don't see how a compat bit would work here, we use them for feature >>>>> compatibility and for general access to data (full or read-only). What >>>>> we do with individual behavioral changes are sysfs files. They're >>>>> detectable by scripts and can be also used for configuration. In this >>>>> case enabling/disabling autoclean of the qgroups. >>> >>> This was my initial thought too, but your compat bit idea is interesting >>> since it persists? I vote sysfs since it has good >>> infrastructure/momentum already for similar config. >> >> Sysfs is another way which we are already utilizing for qgroups, like >> drop_subtree_threshold. >> >> The problem is as mentioned already, it's not persistent, thus it needs >> a user space daemon to set it for every fs after mount. > > My idea of using sysfs is to export the information that the > autocleaning feature is present and if we make it on by default then > there's no need for additional step to enable it. The feedback about > that was that it should have been default so we're going to make that > change, but with sysfs export also provide a fallback to disable it in > case it breaks things for somebody. > >> I'm totally fine to go sysfs for now, but I really hope to a persistent >> solution. >> Maybe a dedicated config tree? > > No, we already have a way to store data in the trees or in the > properties so no new tree. That means a on-disk format change. IIRC everytime we introduce a new TEMP objectid, it should at least be compat or compat_ro bit change. Or older kernel won't understand nor follow the new TEMP key. Thanks, Qu
On Fri, May 03, 2024 at 06:59:03AM +0930, Qu Wenruo wrote: > >> The problem is as mentioned already, it's not persistent, thus it needs > >> a user space daemon to set it for every fs after mount. > > > > My idea of using sysfs is to export the information that the > > autocleaning feature is present and if we make it on by default then > > there's no need for additional step to enable it. The feedback about > > that was that it should have been default so we're going to make that > > change, but with sysfs export also provide a fallback to disable it in > > case it breaks things for somebody. > > > >> I'm totally fine to go sysfs for now, but I really hope to a persistent > >> solution. > >> Maybe a dedicated config tree? > > > > No, we already have a way to store data in the trees or in the > > properties so no new tree. > > That means a on-disk format change. > IIRC everytime we introduce a new TEMP objectid, it should at least be > compat or compat_ro bit change. > > Or older kernel won't understand nor follow the new TEMP key. This is not necessarily a problem, if older kernel does not understand the key it'll skip it. The search treats the keys as numbers. If there's something more associated with the additional data that would prevent mount then a compat bit would make sense. The compat_ro does not make sense here, the same filesystem can be mounted and data read or written, the only difference is that qgroups are also removed. How is that incompatible on the same level as other features like BGT?
在 2024/5/3 22:16, David Sterba 写道: > On Fri, May 03, 2024 at 06:59:03AM +0930, Qu Wenruo wrote: >>>> The problem is as mentioned already, it's not persistent, thus it needs >>>> a user space daemon to set it for every fs after mount. >>> >>> My idea of using sysfs is to export the information that the >>> autocleaning feature is present and if we make it on by default then >>> there's no need for additional step to enable it. The feedback about >>> that was that it should have been default so we're going to make that >>> change, but with sysfs export also provide a fallback to disable it in >>> case it breaks things for somebody. >>> >>>> I'm totally fine to go sysfs for now, but I really hope to a persistent >>>> solution. >>>> Maybe a dedicated config tree? >>> >>> No, we already have a way to store data in the trees or in the >>> properties so no new tree. >> >> That means a on-disk format change. >> IIRC everytime we introduce a new TEMP objectid, it should at least be >> compat or compat_ro bit change. >> >> Or older kernel won't understand nor follow the new TEMP key. > > This is not necessarily a problem, if older kernel does not understand > the key it'll skip it. The search treats the keys as numbers. If there's > something more associated with the additional data that would prevent > mount then a compat bit would make sense. > > The compat_ro does not make sense here, the same filesystem can be > mounted and data read or written, the only difference is that qgroups > are also removed. How is that incompatible on the same level as other > features like BGT? That's why BGT is compat_ro, not compat. And exactly why I want to push a new compat bit just for qgroup auto reap. Thanks, Qu
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 023920d0d971..1e2caa234146 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -5834,6 +5834,7 @@ int btrfs_drop_snapshot(struct btrfs_root *root, int update_ref, int for_reloc) struct btrfs_root_item *root_item = &root->root_item; struct walk_control *wc; struct btrfs_key key; + const u64 rootid = btrfs_root_id(root); int err = 0; int ret; int level; @@ -6064,6 +6065,13 @@ int btrfs_drop_snapshot(struct btrfs_root *root, int update_ref, int for_reloc) kfree(wc); btrfs_free_path(path); out: + if (!err && root_dropped) { + ret = btrfs_qgroup_cleanup_dropped_subvolume(fs_info, rootid); + if (ret < 0) + btrfs_warn_rl(fs_info, + "failed to cleanup qgroup 0/%llu: %d", + rootid, ret); + } /* * We were an unfinished drop root, check to see if there are any * pending, and if not clear and wake up any waiters. diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c index 2ea16a07a7d4..9aeb740388ab 100644 --- a/fs/btrfs/qgroup.c +++ b/fs/btrfs/qgroup.c @@ -1859,6 +1859,39 @@ int btrfs_remove_qgroup(struct btrfs_trans_handle *trans, u64 qgroupid) return ret; } +int btrfs_qgroup_cleanup_dropped_subvolume(struct btrfs_fs_info *fs_info, + u64 subvolid) +{ + struct btrfs_trans_handle *trans; + int ret; + + if (!is_fstree(subvolid) || !btrfs_qgroup_enabled(fs_info) || + !fs_info->quota_root) + return 0; + + /* + * If our qgroup is consistent, commit current transaction to make sure + * all the rfer/excl numbers get updated to 0 before deleting. + */ + if (!(fs_info->qgroup_flags & BTRFS_QGROUP_STATUS_FLAG_INCONSISTENT)) { + trans = btrfs_start_transaction(fs_info->quota_root, 0); + if (IS_ERR(trans)) + return PTR_ERR(trans); + + ret = btrfs_commit_transaction(trans); + if (ret < 0) + return ret; + } + + /* Start new trans to delete the qgroup info and limit items. */ + trans = btrfs_start_transaction(fs_info->quota_root, 2); + if (IS_ERR(trans)) + return PTR_ERR(trans); + ret = btrfs_remove_qgroup(trans, subvolid); + btrfs_end_transaction(trans); + return ret; +} + int btrfs_limit_qgroup(struct btrfs_trans_handle *trans, u64 qgroupid, struct btrfs_qgroup_limit *limit) { @@ -4877,7 +4910,11 @@ int btrfs_record_squota_delta(struct btrfs_fs_info *fs_info, spin_lock(&fs_info->qgroup_lock); qgroup = find_qgroup_rb(fs_info, root); if (!qgroup) { - ret = -ENOENT; + /* + * The qgroup can be auto deleted by subvolume deleting. + * In that case do not consider it an error. + */ + ret = 0; goto out; } diff --git a/fs/btrfs/qgroup.h b/fs/btrfs/qgroup.h index 706640be0ec2..3f93856a02e1 100644 --- a/fs/btrfs/qgroup.h +++ b/fs/btrfs/qgroup.h @@ -327,6 +327,8 @@ int btrfs_del_qgroup_relation(struct btrfs_trans_handle *trans, u64 src, u64 dst); int btrfs_create_qgroup(struct btrfs_trans_handle *trans, u64 qgroupid); int btrfs_remove_qgroup(struct btrfs_trans_handle *trans, u64 qgroupid); +int btrfs_qgroup_cleanup_dropped_subvolume(struct btrfs_fs_info *fs_info, + u64 subvolid); int btrfs_limit_qgroup(struct btrfs_trans_handle *trans, u64 qgroupid, struct btrfs_qgroup_limit *limit); int btrfs_read_qgroup_config(struct btrfs_fs_info *fs_info);
Currently if we fully removed a subvolume (not only unlinked, but fully dropped its root item), its qgroup would not be removed. Thus we have "btrfs qgroup clear-stale" to handle such 0 level qgroups. This patch changes the behavior by automatically removing the qgroup of a fully dropped subvolume. There is an exception for simple quota, that btrfs_record_squota_delta() has to handle missing qgroup case, where the target delta belongs to an automatically deleted subvolume. In that case since the subvolume is already gone, no need to treat it as an error. Link: https://bugzilla.suse.com/show_bug.cgi?id=1222847 Signed-off-by: Qu Wenruo <wqu@suse.com> --- fs/btrfs/extent-tree.c | 8 ++++++++ fs/btrfs/qgroup.c | 39 ++++++++++++++++++++++++++++++++++++++- fs/btrfs/qgroup.h | 2 ++ 3 files changed, 48 insertions(+), 1 deletion(-)