Message ID | cover.1690495785.git.boris@bur.io (mailing list archive) |
---|---|
Headers | show |
Series | btrfs: simple quotas | expand |
On Thu, Jul 27, 2023 at 03:12:47PM -0700, Boris Burkov wrote: > btrfs quota groups (qgroups) are a compelling feature of btrfs that > allow flexible control for limiting subvolume data and metadata usage. > However, due to btrfs's high level decision to tradeoff snapshot > performance against ref-counting performance, qgroups suffer from > non-trivial performance issues that make them unattractive in certain > workloads. Particularly, frequent backref walking during writes and > during commits can make operations increasingly expensive as the number > of snapshots scales up. For that reason, we have never been able to > commit to using qgroups in production at Meta, despite significant > interest from people running container workloads, where we would benefit > from protecting the rest of the host from a buggy application in a > container running away with disk usage. This patch series introduces a > simplified version of qgroups called > simple quotas (squotas) which never computes global reference counts > for extents, and thus has similar performance characteristics to normal, > quotas disabled, btrfs. The "trick" is that in simple quotas mode, we > account all extents permanently to the subvolume in which they were > originally created. That allows us to make all accounting 1:1 with > extent item lifetime, removing the need to walk backrefs. However, > this sacrifices the ability to compute shared vs. exclusive usage. It > also results in counter-intuitive, though still predictable and simple > accounting in the cases where an original extent is removed while a > shared copy still exists. Qgroups is able to detect that case and count > the remaining copy as an exclusive owner, while squotas is not. As a > result, squotas works best when the original extent is immutable and > outlives any clones. > > ==Format Change== > In order to track the original creating subvolume of a data extent in > the face of reflinks, it is necessary to add additional accounting to > the extent item. To save space, this is done with a new inline ref item. > However, the downside of this approach is that it makes enabling squota > an incompat change, denoted by the new incompat bit SIMPLE_QUOTA. When > this bit is set and quotas are enabled, new extent items get the extra > accounting, and freed extent items check for the accounting to find > their creating subvolume. In addition, 1:1 with this incompat bit, > the quota status item now tracks a "quota enablement generation" needed > for properly handling deleting extents with predate enablement. > > ==API== > Squotas reuses the api of qgroups. So apart from the accounting, the hierarchy of qgroups can be still built as before, right? In the example you create a group 1/100 so I assume that it's still qgroups from the outside, and that the limits can be set. Because if not, then squotas would make more sense as a separate infrastructure, under quotas. Like that quotas are the abstraction while qgroups or squota would be the implementation. > The only difference is that when you > enable quotas via `btrfs quota enable`, you pass the `--simple` flag. > Squotas will always report exclusive == shared for each qgroup. Squotas > deal with extent_item/metadata_item sizes and thus do not do anything > special with compression. Squotas also introduce auto inheritance for > nested subvols. The API is documented more fully in the documentation > patches in btrfs-progs. The lack of exclusive size sharing will be confusing I guess, so we need to make it clear in the documentation and in the UI that it's either full or simple mode. I've added the patchset to for-next, we may need an iteration or two to fix some issues I've seen so far but on the fundamental level I think it's ok.
On Thu, Sep 07, 2023 at 12:51:15PM +0200, David Sterba wrote: > On Thu, Jul 27, 2023 at 03:12:47PM -0700, Boris Burkov wrote: > > btrfs quota groups (qgroups) are a compelling feature of btrfs that > > allow flexible control for limiting subvolume data and metadata usage. > > However, due to btrfs's high level decision to tradeoff snapshot > > performance against ref-counting performance, qgroups suffer from > > non-trivial performance issues that make them unattractive in certain > > workloads. Particularly, frequent backref walking during writes and > > during commits can make operations increasingly expensive as the number > > of snapshots scales up. For that reason, we have never been able to > > commit to using qgroups in production at Meta, despite significant > > interest from people running container workloads, where we would benefit > > from protecting the rest of the host from a buggy application in a > > container running away with disk usage. This patch series introduces a > > simplified version of qgroups called > > simple quotas (squotas) which never computes global reference counts > > for extents, and thus has similar performance characteristics to normal, > > quotas disabled, btrfs. The "trick" is that in simple quotas mode, we > > account all extents permanently to the subvolume in which they were > > originally created. That allows us to make all accounting 1:1 with > > extent item lifetime, removing the need to walk backrefs. However, > > this sacrifices the ability to compute shared vs. exclusive usage. It > > also results in counter-intuitive, though still predictable and simple > > accounting in the cases where an original extent is removed while a > > shared copy still exists. Qgroups is able to detect that case and count > > the remaining copy as an exclusive owner, while squotas is not. As a > > result, squotas works best when the original extent is immutable and > > outlives any clones. > > > > ==Format Change== > > In order to track the original creating subvolume of a data extent in > > the face of reflinks, it is necessary to add additional accounting to > > the extent item. To save space, this is done with a new inline ref item. > > However, the downside of this approach is that it makes enabling squota > > an incompat change, denoted by the new incompat bit SIMPLE_QUOTA. When > > this bit is set and quotas are enabled, new extent items get the extra > > accounting, and freed extent items check for the accounting to find > > their creating subvolume. In addition, 1:1 with this incompat bit, > > the quota status item now tracks a "quota enablement generation" needed > > for properly handling deleting extents with predate enablement. > > > > ==API== > > Squotas reuses the api of qgroups. > > So apart from the accounting, the hierarchy of qgroups can be still > built as before, right? In the example you create a group 1/100 so I > assume that it's still qgroups from the outside, and that the limits can > be set. Yes, you can create quota group hierarchies with the same nesting behavior. I am only changing the accounting methodology (and added auto hierarchy) > > Because if not, then squotas would make more sense as a separate > infrastructure, under quotas. Like that quotas are the abstraction while > qgroups or squota would be the implementation. > > > The only difference is that when you > > enable quotas via `btrfs quota enable`, you pass the `--simple` flag. > > Squotas will always report exclusive == shared for each qgroup. Squotas > > deal with extent_item/metadata_item sizes and thus do not do anything > > special with compression. Squotas also introduce auto inheritance for > > nested subvols. The API is documented more fully in the documentation > > patches in btrfs-progs. > > The lack of exclusive size sharing will be confusing I guess, so we need > to make it clear in the documentation and in the UI that it's either > full or simple mode. I am happy to iterate on that. I think always reporting as shared=0, since the *ownership* is exclusive. I opted for making them equal since it sort of both shared usage (we don't know if it's shared nor when it will be freed) and exclusive usage (belongs to this subvol by owner ref) > > I've added the patchset to for-next, we may need an iteration or two to > fix some issues I've seen so far but on the fundamental level I think > it's ok.
On Thu, Sep 07, 2023 at 01:51:31PM -0700, Boris Burkov wrote: > On Thu, Sep 07, 2023 at 12:51:15PM +0200, David Sterba wrote: > > On Thu, Jul 27, 2023 at 03:12:47PM -0700, Boris Burkov wrote: > > > btrfs quota groups (qgroups) are a compelling feature of btrfs that > > > allow flexible control for limiting subvolume data and metadata usage. > > > However, due to btrfs's high level decision to tradeoff snapshot > > > performance against ref-counting performance, qgroups suffer from > > > non-trivial performance issues that make them unattractive in certain > > > workloads. Particularly, frequent backref walking during writes and > > > during commits can make operations increasingly expensive as the number > > > of snapshots scales up. For that reason, we have never been able to > > > commit to using qgroups in production at Meta, despite significant > > > interest from people running container workloads, where we would benefit > > > from protecting the rest of the host from a buggy application in a > > > container running away with disk usage. This patch series introduces a > > > simplified version of qgroups called > > > simple quotas (squotas) which never computes global reference counts > > > for extents, and thus has similar performance characteristics to normal, > > > quotas disabled, btrfs. The "trick" is that in simple quotas mode, we > > > account all extents permanently to the subvolume in which they were > > > originally created. That allows us to make all accounting 1:1 with > > > extent item lifetime, removing the need to walk backrefs. However, > > > this sacrifices the ability to compute shared vs. exclusive usage. It > > > also results in counter-intuitive, though still predictable and simple > > > accounting in the cases where an original extent is removed while a > > > shared copy still exists. Qgroups is able to detect that case and count > > > the remaining copy as an exclusive owner, while squotas is not. As a > > > result, squotas works best when the original extent is immutable and > > > outlives any clones. > > > > > > ==Format Change== > > > In order to track the original creating subvolume of a data extent in > > > the face of reflinks, it is necessary to add additional accounting to > > > the extent item. To save space, this is done with a new inline ref item. > > > However, the downside of this approach is that it makes enabling squota > > > an incompat change, denoted by the new incompat bit SIMPLE_QUOTA. When > > > this bit is set and quotas are enabled, new extent items get the extra > > > accounting, and freed extent items check for the accounting to find > > > their creating subvolume. In addition, 1:1 with this incompat bit, > > > the quota status item now tracks a "quota enablement generation" needed > > > for properly handling deleting extents with predate enablement. > > > > > > ==API== > > > Squotas reuses the api of qgroups. > > > > So apart from the accounting, the hierarchy of qgroups can be still > > built as before, right? In the example you create a group 1/100 so I > > assume that it's still qgroups from the outside, and that the limits can > > be set. > > Yes, you can create quota group hierarchies with the same nesting > behavior. I am only changing the accounting methodology (and added auto > hierarchy) OK, makes sense. The hierarchy does not need to be used and is probably less practical for the simple accounting. What I had in mind was some kind of flat hierarchy, now that the simple accounting is there. People were asking about that in the past, wit the drawback of lack of shared/exclusive accounting. Adding a separate subcommands and tooling around flat quotas could be done but with squotas as well, just "don't use the hierarchy". > > Because if not, then squotas would make more sense as a separate > > infrastructure, under quotas. Like that quotas are the abstraction while > > qgroups or squota would be the implementation. > > > > > The only difference is that when you > > > enable quotas via `btrfs quota enable`, you pass the `--simple` flag. > > > Squotas will always report exclusive == shared for each qgroup. Squotas > > > deal with extent_item/metadata_item sizes and thus do not do anything > > > special with compression. Squotas also introduce auto inheritance for > > > nested subvols. The API is documented more fully in the documentation > > > patches in btrfs-progs. > > > > The lack of exclusive size sharing will be confusing I guess, so we need > > to make it clear in the documentation and in the UI that it's either > > full or simple mode. > > I am happy to iterate on that. I think always reporting as shared=0, > since the *ownership* is exclusive. I opted for making them equal since > it sort of both shared usage (we don't know if it's shared nor when it > will be freed) and exclusive usage (belongs to this subvol by owner ref) I agree with that reasoning.
On Thu, Sep 07, 2023 at 12:51:15PM +0200, David Sterba wrote: > On Thu, Jul 27, 2023 at 03:12:47PM -0700, Boris Burkov wrote: > I've added the patchset to for-next, There's a merge conflict due to Filipe's delayed refs changes, "btrfs: record simple quota deltas" new parameter to __btrfs_free_extent, run_delayed_data_ref and maybe others. I may resolve that for for-next but this could duplicate work if you that too so I can wait for a resend with other things updated.