Message ID | ad77c4035370a5db4e0b813da9eff73e52fd30c0.1694174371.git.fdmanana@suse.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | btrfs: updates to delayed refs accounting and space reservation | expand |
On Fri, Sep 08, 2023 at 01:09:03PM +0100, fdmanana@kernel.org wrote: > From: Filipe Manana <fdmanana@suse.com> > > If we have two (or more) tasks attempting to refill the delayed refs block > reserve we can end up with the delayed block reserve being over reserved, > that is, with a reserved space greater than its size. If this happens, we > are holding to more reserved space than necessary, however it may or may > not be harmless. In case the delayed refs block reserve size later > increases to match or go beyond the reserved space, then nothing bad > happens, we end up simply avoiding doing a space reservation later at the > expense of doing it now. However if the delayed refs block reserve size > never increases to match its reserved size, then at unmount time we will > trigger the following warning from btrfs_release_global_block_rsv(): > > WARN_ON(fs_info->delayed_block_rsv.reserved > 0); > > The race happens like this: > > 1) The delayed refs block reserve has a size of 8M and a reserved space of > 6M for example; > > 2) Task A calls btrfs_delayed_refs_rsv_refill(); > > 3) Task B also calls btrfs_delayed_refs_rsv_refill(); > > 4) Task A sees there's a 2M difference between the size and the reserved > space of the delayed refs rsv, so it will reserve 2M of space by > calling btrfs_reserve_metadata_bytes(); > > 5) Task B also sees that 2M difference, and like task A, it reserves > another 2M of metadata space; > > 6) Both task A and task B increase the reserved space of block reserve > by 2M, by calling btrfs_block_rsv_add_bytes(), so the block reserve > ends up with a size of 8M and a reserved space of 10M; > > 7) As delayed refs are run and their space released, the delayed refs > block reserve ends up with a size of 0 and a reserved space of 2M; > This part shouldn't happen, delayed refs space only manipulates delayed_refs_rsv->size, so when we drop their space, we do btrfs_delayed_refs_rsv_release -> btrfs_block_rsv_release -> block_rsv_release_bytes which does spin_lock(&block_rsv->lock); if (num_bytes == (u64)-1) { num_bytes = block_rsv->size; qgroup_to_release = block_rsv->qgroup_rsv_size; } block_rsv->size -= num_bytes; if (block_rsv->reserved >= block_rsv->size) { num_bytes = block_rsv->reserved - block_rsv->size; block_rsv->reserved = block_rsv->size; block_rsv->full = true; } so if A and B race and the reservation is now much larger than size, the extra will be cut off at the end when we call btrfs_delayed_refs_rsv_release. Now I'm all for not holding the over-reserve for that long, but this analysis is incorrect, we would have to somehow do the btrfs_delayed_refs_rsv_refill() and then not call btrfs_delayed_refs_rsv_release() at some point. This *could* happen I suppose by starting a trans handle and not modifying anything while the delayed refs are run and finish before we do our reserve, and in that case we didn't generate a delayed ref that could later on call btrfs_delayed_refs_rsv_release() to clear the excess reservation, but that's a decidedly different problem with a different solution. Thanks, Josef
On Fri, Sep 8, 2023 at 3:46 PM Josef Bacik <josef@toxicpanda.com> wrote: > > On Fri, Sep 08, 2023 at 01:09:03PM +0100, fdmanana@kernel.org wrote: > > From: Filipe Manana <fdmanana@suse.com> > > > > If we have two (or more) tasks attempting to refill the delayed refs block > > reserve we can end up with the delayed block reserve being over reserved, > > that is, with a reserved space greater than its size. If this happens, we > > are holding to more reserved space than necessary, however it may or may > > not be harmless. In case the delayed refs block reserve size later > > increases to match or go beyond the reserved space, then nothing bad > > happens, we end up simply avoiding doing a space reservation later at the > > expense of doing it now. However if the delayed refs block reserve size > > never increases to match its reserved size, then at unmount time we will > > trigger the following warning from btrfs_release_global_block_rsv(): > > > > WARN_ON(fs_info->delayed_block_rsv.reserved > 0); > > > > The race happens like this: > > > > 1) The delayed refs block reserve has a size of 8M and a reserved space of > > 6M for example; > > > > 2) Task A calls btrfs_delayed_refs_rsv_refill(); > > > > 3) Task B also calls btrfs_delayed_refs_rsv_refill(); > > > > 4) Task A sees there's a 2M difference between the size and the reserved > > space of the delayed refs rsv, so it will reserve 2M of space by > > calling btrfs_reserve_metadata_bytes(); > > > > 5) Task B also sees that 2M difference, and like task A, it reserves > > another 2M of metadata space; > > > > 6) Both task A and task B increase the reserved space of block reserve > > by 2M, by calling btrfs_block_rsv_add_bytes(), so the block reserve > > ends up with a size of 8M and a reserved space of 10M; > > > > 7) As delayed refs are run and their space released, the delayed refs > > block reserve ends up with a size of 0 and a reserved space of 2M; > > > > This part shouldn't happen, delayed refs space only manipulates > delayed_refs_rsv->size, so when we drop their space, we do > > btrfs_delayed_refs_rsv_release > -> btrfs_block_rsv_release > -> block_rsv_release_bytes > > which does > > spin_lock(&block_rsv->lock); > if (num_bytes == (u64)-1) { > num_bytes = block_rsv->size; > qgroup_to_release = block_rsv->qgroup_rsv_size; > } > block_rsv->size -= num_bytes; > if (block_rsv->reserved >= block_rsv->size) { > num_bytes = block_rsv->reserved - block_rsv->size; > block_rsv->reserved = block_rsv->size; > block_rsv->full = true; > } > > so if A and B race and the reservation is now much larger than size, the extra > will be cut off at the end when we call btrfs_delayed_refs_rsv_release. > > Now I'm all for not holding the over-reserve for that long, but this analysis is > incorrect, we would have to somehow do the btrfs_delayed_refs_rsv_refill() and > then not call btrfs_delayed_refs_rsv_release() at some point. > > This *could* happen I suppose by starting a trans handle and not modifying > anything while the delayed refs are run and finish before we do our reserve, and > in that case we didn't generate a delayed ref that could later on call > btrfs_delayed_refs_rsv_release() to clear the excess reservation, but that's a > decidedly different problem with a different solution. Thanks, Oh yes, let me fix that. Holding the extra space for longer than necessary was actually the thing I noticed first. Fixed in v2 thanks. > > Josef
diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c index 6a13cf00218b..1043f66cc130 100644 --- a/fs/btrfs/delayed-ref.c +++ b/fs/btrfs/delayed-ref.c @@ -163,6 +163,8 @@ int btrfs_delayed_refs_rsv_refill(struct btrfs_fs_info *fs_info, struct btrfs_block_rsv *block_rsv = &fs_info->delayed_refs_rsv; u64 limit = btrfs_calc_delayed_ref_bytes(fs_info, 1); u64 num_bytes = 0; + u64 refilled_bytes; + u64 to_free; int ret = -ENOSPC; spin_lock(&block_rsv->lock); @@ -178,9 +180,38 @@ int btrfs_delayed_refs_rsv_refill(struct btrfs_fs_info *fs_info, ret = btrfs_reserve_metadata_bytes(fs_info, block_rsv, num_bytes, flush); if (ret) return ret; - btrfs_block_rsv_add_bytes(block_rsv, num_bytes, false); - trace_btrfs_space_reservation(fs_info, "delayed_refs_rsv", - 0, num_bytes, 1); + + /* + * We may have raced with someone else, so check again if we the block + * reserve is still not full and release any excess space. + */ + spin_lock(&block_rsv->lock); + if (block_rsv->reserved < block_rsv->size) { + u64 needed = block_rsv->size - block_rsv->reserved; + + if (num_bytes >= needed) { + block_rsv->reserved += needed; + block_rsv->full = true; + to_free = num_bytes - needed; + refilled_bytes = needed; + } else { + block_rsv->reserved += num_bytes; + to_free = 0; + refilled_bytes = num_bytes; + } + } else { + to_free = num_bytes; + refilled_bytes = 0; + } + spin_unlock(&block_rsv->lock); + + if (to_free > 0) + btrfs_space_info_free_bytes_may_use(fs_info, block_rsv->space_info, + to_free); + + if (refilled_bytes > 0) + trace_btrfs_space_reservation(fs_info, "delayed_refs_rsv", 0, + refilled_bytes, 1); return 0; }