Message ID | 20241009153032.23336-2-jth@kernel.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | implement truncation for RAID stripe-extents | expand |
On 09.10.24 17:32, Johannes Thumshirn wrote: > @@ -43,9 +91,8 @@ int btrfs_delete_raid_extent(struct btrfs_trans_handle *trans, u64 start, u64 le > break; > if (ret > 0) { > ret = 0; > - if (path->slots[0] == 0) > - break; > - path->slots[0]--; > + if (path->slots[0] > 0) > + path->slots[0]--; > } > > leaf = path->nodes[0]; That part is wrong, will send a v4 shortly.
On Wed, Oct 9, 2024 at 5:00 PM Johannes Thumshirn <jth@kernel.org> wrote: > > From: Johannes Thumshirn <johannes.thumshirn@wdc.com> > > In our CI system, the RAID stripe tree configuration sometimes fails with > the following ASSERT(): > > assertion failed: found_start >= start && found_end <= end, in fs/btrfs/raid-stripe-tree.c:64 > > This ASSERT()ion triggers, because for the initial design of RAID > stripe-tree, I had the "one ordered-extent equals one bio" rule of zoned > btrfs in mind. > > But for a RAID stripe-tree based system, that is not hosted on a zoned > storage device, but on a regular device this rule doesn't apply. > > So in case the range we want to delete starts in the middle of the > previous item, grab the item and "truncate" it's length. That is, clone > the item, subtract the deleted portion from the key's offset, delete the > old item and insert the new one. > > In case the range to delete ends in the middle of an item, we have to > adjust both the item's key as well as the stripe extents and then > re-insert the modified clone into the tree after deleting the old stripe > extent. > > Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> > --- > fs/btrfs/raid-stripe-tree.c | 85 +++++++++++++++++++++++++++++++++++-- > 1 file changed, 81 insertions(+), 4 deletions(-) > > diff --git a/fs/btrfs/raid-stripe-tree.c b/fs/btrfs/raid-stripe-tree.c > index 41970bbdb05f..40cc0a392be2 100644 > --- a/fs/btrfs/raid-stripe-tree.c > +++ b/fs/btrfs/raid-stripe-tree.c > @@ -13,6 +13,54 @@ > #include "volumes.h" > #include "print-tree.h" > > +static int btrfs_partially_delete_raid_extent(struct btrfs_trans_handle *trans, > + struct btrfs_path *path, > + struct btrfs_key *oldkey, > + u64 newlen, u64 frontpad) > +{ > + struct btrfs_root *stripe_root = trans->fs_info->stripe_root; > + struct btrfs_stripe_extent *extent, *new; > + struct extent_buffer *leaf = path->nodes[0]; > + int slot = path->slots[0]; > + const size_t item_size = btrfs_item_size(leaf, slot); > + struct btrfs_key newkey; > + int ret; > + int i; > + > + new = kzalloc(item_size, GFP_NOFS); > + if (!new) > + return -ENOMEM; > + > + memcpy(&newkey, oldkey, sizeof(struct btrfs_key)); > + newkey.objectid += frontpad; > + newkey.offset -= newlen; > + > + extent = btrfs_item_ptr(leaf, slot, struct btrfs_stripe_extent); > + > + for (i = 0; i < btrfs_num_raid_stripes(item_size); i++) { The loop variable could be declared here in the for expression, as it's not used anywhere outside it. > + u64 devid; > + u64 phys; > + > + devid = btrfs_raid_stride_devid(leaf, &extent->strides[i]); > + btrfs_set_stack_raid_stride_devid(&new->strides[i], devid); > + > + phys = btrfs_raid_stride_physical(leaf, &extent->strides[i]); > + phys += frontpad; > + btrfs_set_stack_raid_stride_physical(&new->strides[i], phys); > + } > + > + ret = btrfs_del_item(trans, stripe_root, path); > + if (ret) > + goto out; > + > + btrfs_release_path(path); > + ret = btrfs_insert_item(trans, stripe_root, &newkey, new, item_size); So instead of doing a deletion followed by an insertion, which implies two searches in the btree and occasional node/leaf merges and splits, can't we do this in a single search? By adjusting item keys, updating items and duplicating them (followed by updating them), similar to what we do at btrfs_drop_extents() for example. Otherwise this may result in very high lock contention and extra work. It's ok for an initial implementation and can be improved later, but I was just curious if there's any reason besides simplicity for now. Thanks. > + > + out: > + kfree(new); > + return ret; > +} > + > int btrfs_delete_raid_extent(struct btrfs_trans_handle *trans, u64 start, u64 length) > { > struct btrfs_fs_info *fs_info = trans->fs_info; > @@ -43,9 +91,8 @@ int btrfs_delete_raid_extent(struct btrfs_trans_handle *trans, u64 start, u64 le > break; > if (ret > 0) { > ret = 0; > - if (path->slots[0] == 0) > - break; > - path->slots[0]--; > + if (path->slots[0] > 0) > + path->slots[0]--; > } > > leaf = path->nodes[0]; > @@ -61,7 +108,37 @@ int btrfs_delete_raid_extent(struct btrfs_trans_handle *trans, u64 start, u64 le > trace_btrfs_raid_extent_delete(fs_info, start, end, > found_start, found_end); > > - ASSERT(found_start >= start && found_end <= end); > + /* > + * The stripe extent starts before the range we want to delete: > + * > + * |--- RAID Stripe Extent ---| > + * |--- keep ---|--- drop ---| > + * > + * This means we have to duplicate the tree item, truncate the > + * length to the new size and then re-insert the item. > + */ > + if (found_start < start) { > + ret = btrfs_partially_delete_raid_extent(trans, path, &key, > + start - found_start, 0); > + break; > + } > + > + /* > + * The stripe extent ends after the range we want to delete: > + * > + * |--- RAID Stripe Extent ---| > + * |--- drop ---|--- keep ---| > + * This means we have to duplicate the tree item, truncate the > + * length to the new size and then re-insert the item. > + */ > + if (found_end > end) { > + u64 diff = found_end - end; > + > + ret = btrfs_partially_delete_raid_extent(trans, path, &key, > + diff, diff); > + break; > + } > ret = btrfs_del_item(trans, stripe_root, path); > if (ret) > break; > -- > 2.43.0 > >
On 09.10.24 18:42, Filipe Manana wrote: >> >> +static int btrfs_partially_delete_raid_extent(struct btrfs_trans_handle *trans, >> + struct btrfs_path *path, >> + struct btrfs_key *oldkey, >> + u64 newlen, u64 frontpad) >> +{ >> + struct btrfs_root *stripe_root = trans->fs_info->stripe_root; >> + struct btrfs_stripe_extent *extent, *new; >> + struct extent_buffer *leaf = path->nodes[0]; >> + int slot = path->slots[0]; >> + const size_t item_size = btrfs_item_size(leaf, slot); >> + struct btrfs_key newkey; >> + int ret; >> + int i; >> + >> + new = kzalloc(item_size, GFP_NOFS); >> + if (!new) >> + return -ENOMEM; >> + >> + memcpy(&newkey, oldkey, sizeof(struct btrfs_key)); >> + newkey.objectid += frontpad; >> + newkey.offset -= newlen; >> + >> + extent = btrfs_item_ptr(leaf, slot, struct btrfs_stripe_extent); >> + >> + for (i = 0; i < btrfs_num_raid_stripes(item_size); i++) { > > The loop variable could be declared here in the for expression, as > it's not used anywhere outside it. yup will fix that up. >> + u64 devid; >> + u64 phys; >> + >> + devid = btrfs_raid_stride_devid(leaf, &extent->strides[i]); >> + btrfs_set_stack_raid_stride_devid(&new->strides[i], devid); >> + >> + phys = btrfs_raid_stride_physical(leaf, &extent->strides[i]); >> + phys += frontpad; >> + btrfs_set_stack_raid_stride_physical(&new->strides[i], phys); >> + } >> + >> + ret = btrfs_del_item(trans, stripe_root, path); >> + if (ret) >> + goto out; >> + >> + btrfs_release_path(path); >> + ret = btrfs_insert_item(trans, stripe_root, &newkey, new, item_size); > > So instead of doing a deletion followed by an insertion, which implies > two searches in the btree and occasional node/leaf merges and splits, > can't we do this in a single search? > By adjusting item keys, updating items and duplicating them (followed > by updating them), similar to what we do at btrfs_drop_extents() for > example. > Otherwise this may result in very high lock contention and extra work. > > It's ok for an initial implementation and can be improved later, but I > was just curious if there's any reason besides simplicity for now. I did have a version using btrfs_duplicate_item() and dropped it again. But yes sure I can resurrect it. But firstly I have to find out why both of these (- and +) are buggy. - if (path->slots[0] == 0) - break; - path->slots[0]--; + if (path->slots[0] > 0) + path->slots[0]--; The '-' version passes xfstests but not the selftest (as it's the 1st item in the tree, so it doesn't find it and bail out), the '+' version passes the selftest but gives FS data corruption on xfstests, because it deletes the wrong data.
diff --git a/fs/btrfs/raid-stripe-tree.c b/fs/btrfs/raid-stripe-tree.c index 41970bbdb05f..40cc0a392be2 100644 --- a/fs/btrfs/raid-stripe-tree.c +++ b/fs/btrfs/raid-stripe-tree.c @@ -13,6 +13,54 @@ #include "volumes.h" #include "print-tree.h" +static int btrfs_partially_delete_raid_extent(struct btrfs_trans_handle *trans, + struct btrfs_path *path, + struct btrfs_key *oldkey, + u64 newlen, u64 frontpad) +{ + struct btrfs_root *stripe_root = trans->fs_info->stripe_root; + struct btrfs_stripe_extent *extent, *new; + struct extent_buffer *leaf = path->nodes[0]; + int slot = path->slots[0]; + const size_t item_size = btrfs_item_size(leaf, slot); + struct btrfs_key newkey; + int ret; + int i; + + new = kzalloc(item_size, GFP_NOFS); + if (!new) + return -ENOMEM; + + memcpy(&newkey, oldkey, sizeof(struct btrfs_key)); + newkey.objectid += frontpad; + newkey.offset -= newlen; + + extent = btrfs_item_ptr(leaf, slot, struct btrfs_stripe_extent); + + for (i = 0; i < btrfs_num_raid_stripes(item_size); i++) { + u64 devid; + u64 phys; + + devid = btrfs_raid_stride_devid(leaf, &extent->strides[i]); + btrfs_set_stack_raid_stride_devid(&new->strides[i], devid); + + phys = btrfs_raid_stride_physical(leaf, &extent->strides[i]); + phys += frontpad; + btrfs_set_stack_raid_stride_physical(&new->strides[i], phys); + } + + ret = btrfs_del_item(trans, stripe_root, path); + if (ret) + goto out; + + btrfs_release_path(path); + ret = btrfs_insert_item(trans, stripe_root, &newkey, new, item_size); + + out: + kfree(new); + return ret; +} + int btrfs_delete_raid_extent(struct btrfs_trans_handle *trans, u64 start, u64 length) { struct btrfs_fs_info *fs_info = trans->fs_info; @@ -43,9 +91,8 @@ int btrfs_delete_raid_extent(struct btrfs_trans_handle *trans, u64 start, u64 le break; if (ret > 0) { ret = 0; - if (path->slots[0] == 0) - break; - path->slots[0]--; + if (path->slots[0] > 0) + path->slots[0]--; } leaf = path->nodes[0]; @@ -61,7 +108,37 @@ int btrfs_delete_raid_extent(struct btrfs_trans_handle *trans, u64 start, u64 le trace_btrfs_raid_extent_delete(fs_info, start, end, found_start, found_end); - ASSERT(found_start >= start && found_end <= end); + /* + * The stripe extent starts before the range we want to delete: + * + * |--- RAID Stripe Extent ---| + * |--- keep ---|--- drop ---| + * + * This means we have to duplicate the tree item, truncate the + * length to the new size and then re-insert the item. + */ + if (found_start < start) { + ret = btrfs_partially_delete_raid_extent(trans, path, &key, + start - found_start, 0); + break; + } + + /* + * The stripe extent ends after the range we want to delete: + * + * |--- RAID Stripe Extent ---| + * |--- drop ---|--- keep ---| + * This means we have to duplicate the tree item, truncate the + * length to the new size and then re-insert the item. + */ + if (found_end > end) { + u64 diff = found_end - end; + + ret = btrfs_partially_delete_raid_extent(trans, path, &key, + diff, diff); + break; + } + ret = btrfs_del_item(trans, stripe_root, path); if (ret) break;