Message ID | 20241017090411.25336-2-jth@kernel.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | implement truncation for RAID stripe-extents | expand |
On Thu, Oct 17, 2024 at 10:04 AM Johannes Thumshirn <jth@kernel.org> wrote: > > From: Johannes Thumshirn <johannes.thumshirn@wdc.com> > > In our CI system, the RAID stripe tree configuration sometimes fails with > the following ASSERT(): > > assertion failed: found_start >= start && found_end <= end, in fs/btrfs/raid-stripe-tree.c:64 > > This ASSERT()ion triggers, because for the initial design of RAID > stripe-tree, I had the "one ordered-extent equals one bio" rule of zoned > btrfs in mind. > > But for a RAID stripe-tree based system, that is not hosted on a zoned > storage device, but on a regular device this rule doesn't apply. > > So in case the range we want to delete starts in the middle of the > previous item, grab the item and "truncate" it's length. That is, clone > the item, subtract the deleted portion from the key's offset, delete the > old item and insert the new one. > > In case the range to delete ends in the middle of an item, we have to > adjust both the item's key as well as the stripe extents and then > re-insert the modified clone into the tree after deleting the old stripe > extent. > > Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> > --- > fs/btrfs/ctree.c | 1 + > fs/btrfs/raid-stripe-tree.c | 94 ++++++++++++++++++++++++++++++++++--- > 2 files changed, 88 insertions(+), 7 deletions(-) > > diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c > index b11ec86102e3..3f320f6e7767 100644 > --- a/fs/btrfs/ctree.c > +++ b/fs/btrfs/ctree.c > @@ -3863,6 +3863,7 @@ static noinline int setup_leaf_for_split(struct btrfs_trans_handle *trans, > btrfs_item_key_to_cpu(leaf, &key, path->slots[0]); > > BUG_ON(key.type != BTRFS_EXTENT_DATA_KEY && > + key.type != BTRFS_RAID_STRIPE_KEY && > key.type != BTRFS_EXTENT_CSUM_KEY); > > if (btrfs_leaf_free_space(leaf) >= ins_len) > diff --git a/fs/btrfs/raid-stripe-tree.c b/fs/btrfs/raid-stripe-tree.c > index 41970bbdb05f..569273e42d85 100644 > --- a/fs/btrfs/raid-stripe-tree.c > +++ b/fs/btrfs/raid-stripe-tree.c > @@ -13,6 +13,50 @@ > #include "volumes.h" > #include "print-tree.h" > > +static int btrfs_partially_delete_raid_extent(struct btrfs_trans_handle *trans, > + struct btrfs_path *path, > + struct btrfs_key *oldkey, oldkey can be made const. > + u64 newlen, u64 frontpad) > +{ > + struct btrfs_root *stripe_root = trans->fs_info->stripe_root; > + struct btrfs_stripe_extent *extent; > + struct extent_buffer *leaf; > + int slot; > + size_t item_size; > + int ret; > + struct btrfs_key newkey = { > + .objectid = oldkey->objectid + frontpad, > + .type = BTRFS_RAID_STRIPE_KEY, > + .offset = newlen, > + }; > + > + ASSERT(oldkey->type == BTRFS_RAID_STRIPE_KEY); > + ret = btrfs_duplicate_item(trans, stripe_root, path, &newkey); > + if (ret) > + return ret; > + > + leaf = path->nodes[0]; > + slot = path->slots[0]; > + item_size = btrfs_item_size(leaf, slot); > + extent = btrfs_item_ptr(leaf, slot, struct btrfs_stripe_extent); > + > + for (int i = 0; i < btrfs_num_raid_stripes(item_size); i++) { > + struct btrfs_raid_stride *stride = &extent->strides[i]; > + u64 phys; > + > + phys = btrfs_raid_stride_physical(leaf, stride); > + btrfs_set_raid_stride_physical(leaf, stride, phys + frontpad); > + } > + > + btrfs_mark_buffer_dirty(trans, leaf); This is redundant, it was already done by btrfs_duplicate_item(), by the btrfs_search_slot() call in the caller and done by btrfs_del_item() below as well. > + > + /* delete the old item, after we've inserted a new one. */ > + path->slots[0]--; > + ret = btrfs_del_item(trans, stripe_root, path); So actually looking at this, we don't need btrfs_duplicate_item() plus btrfs_del_item(), this can be more lightweight and simpler by doing just: 1) Do the for loop as it is. 2) Then after, or before the for loop, the order doesn't really matter, just do: btrfs_set_item_key_safe(trans, path, &newkey). Less code and it avoids adding a new item and deleting another one, with the shiftings of data in the leaf, etc. Thanks. > + > + return ret; > +} > + > int btrfs_delete_raid_extent(struct btrfs_trans_handle *trans, u64 start, u64 length) > { > struct btrfs_fs_info *fs_info = trans->fs_info; > @@ -36,23 +80,24 @@ int btrfs_delete_raid_extent(struct btrfs_trans_handle *trans, u64 start, u64 le > while (1) { > key.objectid = start; > key.type = BTRFS_RAID_STRIPE_KEY; > - key.offset = length; > + key.offset = 0; > > ret = btrfs_search_slot(trans, stripe_root, &key, path, -1, 1); > if (ret < 0) > break; > - if (ret > 0) { > - ret = 0; > - if (path->slots[0] == 0) > - break; > + > + if (path->slots[0] == btrfs_header_nritems(path->nodes[0])) > path->slots[0]--; > - } > > leaf = path->nodes[0]; > slot = path->slots[0]; > btrfs_item_key_to_cpu(leaf, &key, slot); > found_start = key.objectid; > found_end = found_start + key.offset; > + ret = 0; > + > + if (key.type != BTRFS_RAID_STRIPE_KEY) > + break; > > /* That stripe ends before we start, we're done. */ > if (found_end <= start) > @@ -61,7 +106,42 @@ int btrfs_delete_raid_extent(struct btrfs_trans_handle *trans, u64 start, u64 le > trace_btrfs_raid_extent_delete(fs_info, start, end, > found_start, found_end); > > - ASSERT(found_start >= start && found_end <= end); > + /* > + * The stripe extent starts before the range we want to delete: > + * > + * |--- RAID Stripe Extent ---| > + * |--- keep ---|--- drop ---| > + * > + * This means we have to duplicate the tree item, truncate the > + * length to the new size and then re-insert the item. > + */ > + if (found_start < start) { > + u64 diff = start - found_start; > + > + ret = btrfs_partially_delete_raid_extent(trans, path, > + &key, > + diff, 0); > + break; > + } > + > + /* > + * The stripe extent ends after the range we want to delete: > + * > + * |--- RAID Stripe Extent ---| > + * |--- drop ---|--- keep ---| > + * > + * This means we have to duplicate the tree item, truncate the > + * length to the new size and then re-insert the item. > + */ > + if (found_end > end) { > + u64 diff = found_end - end; > + > + ret = btrfs_partially_delete_raid_extent(trans, path, > + &key, > + diff, diff); > + break; > + } > + > ret = btrfs_del_item(trans, stripe_root, path); > if (ret) > break; > -- > 2.43.0 > >
On 22.10.24 15:53, Filipe Manana wrote: > On Thu, Oct 17, 2024 at 10:04 AM Johannes Thumshirn <jth@kernel.org> wrote: >> >> From: Johannes Thumshirn <johannes.thumshirn@wdc.com> >> >> In our CI system, the RAID stripe tree configuration sometimes fails with >> the following ASSERT(): >> >> assertion failed: found_start >= start && found_end <= end, in fs/btrfs/raid-stripe-tree.c:64 >> >> This ASSERT()ion triggers, because for the initial design of RAID >> stripe-tree, I had the "one ordered-extent equals one bio" rule of zoned >> btrfs in mind. >> >> But for a RAID stripe-tree based system, that is not hosted on a zoned >> storage device, but on a regular device this rule doesn't apply. >> >> So in case the range we want to delete starts in the middle of the >> previous item, grab the item and "truncate" it's length. That is, clone >> the item, subtract the deleted portion from the key's offset, delete the >> old item and insert the new one. >> >> In case the range to delete ends in the middle of an item, we have to >> adjust both the item's key as well as the stripe extents and then >> re-insert the modified clone into the tree after deleting the old stripe >> extent. >> >> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> >> --- >> fs/btrfs/ctree.c | 1 + >> fs/btrfs/raid-stripe-tree.c | 94 ++++++++++++++++++++++++++++++++++--- >> 2 files changed, 88 insertions(+), 7 deletions(-) >> >> diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c >> index b11ec86102e3..3f320f6e7767 100644 >> --- a/fs/btrfs/ctree.c >> +++ b/fs/btrfs/ctree.c >> @@ -3863,6 +3863,7 @@ static noinline int setup_leaf_for_split(struct btrfs_trans_handle *trans, >> btrfs_item_key_to_cpu(leaf, &key, path->slots[0]); >> >> BUG_ON(key.type != BTRFS_EXTENT_DATA_KEY && >> + key.type != BTRFS_RAID_STRIPE_KEY && >> key.type != BTRFS_EXTENT_CSUM_KEY); >> >> if (btrfs_leaf_free_space(leaf) >= ins_len) >> diff --git a/fs/btrfs/raid-stripe-tree.c b/fs/btrfs/raid-stripe-tree.c >> index 41970bbdb05f..569273e42d85 100644 >> --- a/fs/btrfs/raid-stripe-tree.c >> +++ b/fs/btrfs/raid-stripe-tree.c >> @@ -13,6 +13,50 @@ >> #include "volumes.h" >> #include "print-tree.h" >> >> +static int btrfs_partially_delete_raid_extent(struct btrfs_trans_handle *trans, >> + struct btrfs_path *path, >> + struct btrfs_key *oldkey, > > oldkey can be made const. > >> + u64 newlen, u64 frontpad) >> +{ >> + struct btrfs_root *stripe_root = trans->fs_info->stripe_root; >> + struct btrfs_stripe_extent *extent; >> + struct extent_buffer *leaf; >> + int slot; >> + size_t item_size; >> + int ret; >> + struct btrfs_key newkey = { >> + .objectid = oldkey->objectid + frontpad, >> + .type = BTRFS_RAID_STRIPE_KEY, >> + .offset = newlen, >> + }; >> + >> + ASSERT(oldkey->type == BTRFS_RAID_STRIPE_KEY); >> + ret = btrfs_duplicate_item(trans, stripe_root, path, &newkey); >> + if (ret) >> + return ret; >> + >> + leaf = path->nodes[0]; >> + slot = path->slots[0]; >> + item_size = btrfs_item_size(leaf, slot); >> + extent = btrfs_item_ptr(leaf, slot, struct btrfs_stripe_extent); >> + >> + for (int i = 0; i < btrfs_num_raid_stripes(item_size); i++) { >> + struct btrfs_raid_stride *stride = &extent->strides[i]; >> + u64 phys; >> + >> + phys = btrfs_raid_stride_physical(leaf, stride); >> + btrfs_set_raid_stride_physical(leaf, stride, phys + frontpad); >> + } >> + >> + btrfs_mark_buffer_dirty(trans, leaf); > > This is redundant, it was already done by btrfs_duplicate_item(), by > the btrfs_search_slot() call in the caller and done by > btrfs_del_item() below as well. > > >> + >> + /* delete the old item, after we've inserted a new one. */ >> + path->slots[0]--; >> + ret = btrfs_del_item(trans, stripe_root, path); > > So actually looking at this, we don't need btrfs_duplicate_item() > plus btrfs_del_item(), this can be more lightweight and simpler by > doing just: > > 1) Do the for loop as it is. > > 2) Then after, or before the for loop, the order doesn't really > matter, just do: btrfs_set_item_key_safe(trans, path, &newkey). > > Less code and it avoids adding a new item and deleting another one, > with the shiftings of data in the leaf, etc. Oh I didn't know about btrfs_set_item_key_safe(), that sounds like a good plan thanks :) Can I still get rid of btrfs_mark_buffer_dirty then?
On Tue, Oct 22, 2024 at 4:37 PM Johannes Thumshirn <Johannes.Thumshirn@wdc.com> wrote: > > On 22.10.24 15:53, Filipe Manana wrote: > > On Thu, Oct 17, 2024 at 10:04 AM Johannes Thumshirn <jth@kernel.org> wrote: > >> > >> From: Johannes Thumshirn <johannes.thumshirn@wdc.com> > >> > >> In our CI system, the RAID stripe tree configuration sometimes fails with > >> the following ASSERT(): > >> > >> assertion failed: found_start >= start && found_end <= end, in fs/btrfs/raid-stripe-tree.c:64 > >> > >> This ASSERT()ion triggers, because for the initial design of RAID > >> stripe-tree, I had the "one ordered-extent equals one bio" rule of zoned > >> btrfs in mind. > >> > >> But for a RAID stripe-tree based system, that is not hosted on a zoned > >> storage device, but on a regular device this rule doesn't apply. > >> > >> So in case the range we want to delete starts in the middle of the > >> previous item, grab the item and "truncate" it's length. That is, clone > >> the item, subtract the deleted portion from the key's offset, delete the > >> old item and insert the new one. > >> > >> In case the range to delete ends in the middle of an item, we have to > >> adjust both the item's key as well as the stripe extents and then > >> re-insert the modified clone into the tree after deleting the old stripe > >> extent. > >> > >> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> > >> --- > >> fs/btrfs/ctree.c | 1 + > >> fs/btrfs/raid-stripe-tree.c | 94 ++++++++++++++++++++++++++++++++++--- > >> 2 files changed, 88 insertions(+), 7 deletions(-) > >> > >> diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c > >> index b11ec86102e3..3f320f6e7767 100644 > >> --- a/fs/btrfs/ctree.c > >> +++ b/fs/btrfs/ctree.c > >> @@ -3863,6 +3863,7 @@ static noinline int setup_leaf_for_split(struct btrfs_trans_handle *trans, > >> btrfs_item_key_to_cpu(leaf, &key, path->slots[0]); > >> > >> BUG_ON(key.type != BTRFS_EXTENT_DATA_KEY && > >> + key.type != BTRFS_RAID_STRIPE_KEY && > >> key.type != BTRFS_EXTENT_CSUM_KEY); > >> > >> if (btrfs_leaf_free_space(leaf) >= ins_len) > >> diff --git a/fs/btrfs/raid-stripe-tree.c b/fs/btrfs/raid-stripe-tree.c > >> index 41970bbdb05f..569273e42d85 100644 > >> --- a/fs/btrfs/raid-stripe-tree.c > >> +++ b/fs/btrfs/raid-stripe-tree.c > >> @@ -13,6 +13,50 @@ > >> #include "volumes.h" > >> #include "print-tree.h" > >> > >> +static int btrfs_partially_delete_raid_extent(struct btrfs_trans_handle *trans, > >> + struct btrfs_path *path, > >> + struct btrfs_key *oldkey, > > > > oldkey can be made const. > > > >> + u64 newlen, u64 frontpad) > >> +{ > >> + struct btrfs_root *stripe_root = trans->fs_info->stripe_root; > >> + struct btrfs_stripe_extent *extent; > >> + struct extent_buffer *leaf; > >> + int slot; > >> + size_t item_size; > >> + int ret; > >> + struct btrfs_key newkey = { > >> + .objectid = oldkey->objectid + frontpad, > >> + .type = BTRFS_RAID_STRIPE_KEY, > >> + .offset = newlen, > >> + }; > >> + > >> + ASSERT(oldkey->type == BTRFS_RAID_STRIPE_KEY); > >> + ret = btrfs_duplicate_item(trans, stripe_root, path, &newkey); > >> + if (ret) > >> + return ret; > >> + > >> + leaf = path->nodes[0]; > >> + slot = path->slots[0]; > >> + item_size = btrfs_item_size(leaf, slot); > >> + extent = btrfs_item_ptr(leaf, slot, struct btrfs_stripe_extent); > >> + > >> + for (int i = 0; i < btrfs_num_raid_stripes(item_size); i++) { > >> + struct btrfs_raid_stride *stride = &extent->strides[i]; > >> + u64 phys; > >> + > >> + phys = btrfs_raid_stride_physical(leaf, stride); > >> + btrfs_set_raid_stride_physical(leaf, stride, phys + frontpad); > >> + } > >> + > >> + btrfs_mark_buffer_dirty(trans, leaf); > > > > This is redundant, it was already done by btrfs_duplicate_item(), by > > the btrfs_search_slot() call in the caller and done by > > btrfs_del_item() below as well. > > > > > >> + > >> + /* delete the old item, after we've inserted a new one. */ > >> + path->slots[0]--; > >> + ret = btrfs_del_item(trans, stripe_root, path); > > > > So actually looking at this, we don't need btrfs_duplicate_item() > > plus btrfs_del_item(), this can be more lightweight and simpler by > > doing just: > > > > 1) Do the for loop as it is. > > > > 2) Then after, or before the for loop, the order doesn't really > > matter, just do: btrfs_set_item_key_safe(trans, path, &newkey). > > > > Less code and it avoids adding a new item and deleting another one, > > with the shiftings of data in the leaf, etc. > > Oh I didn't know about btrfs_set_item_key_safe(), that sounds like a > good plan thanks :) > Can I still get rid of btrfs_mark_buffer_dirty then? Yes, even because btrfs_set_item_key_safe() already does it.
diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c index b11ec86102e3..3f320f6e7767 100644 --- a/fs/btrfs/ctree.c +++ b/fs/btrfs/ctree.c @@ -3863,6 +3863,7 @@ static noinline int setup_leaf_for_split(struct btrfs_trans_handle *trans, btrfs_item_key_to_cpu(leaf, &key, path->slots[0]); BUG_ON(key.type != BTRFS_EXTENT_DATA_KEY && + key.type != BTRFS_RAID_STRIPE_KEY && key.type != BTRFS_EXTENT_CSUM_KEY); if (btrfs_leaf_free_space(leaf) >= ins_len) diff --git a/fs/btrfs/raid-stripe-tree.c b/fs/btrfs/raid-stripe-tree.c index 41970bbdb05f..569273e42d85 100644 --- a/fs/btrfs/raid-stripe-tree.c +++ b/fs/btrfs/raid-stripe-tree.c @@ -13,6 +13,50 @@ #include "volumes.h" #include "print-tree.h" +static int btrfs_partially_delete_raid_extent(struct btrfs_trans_handle *trans, + struct btrfs_path *path, + struct btrfs_key *oldkey, + u64 newlen, u64 frontpad) +{ + struct btrfs_root *stripe_root = trans->fs_info->stripe_root; + struct btrfs_stripe_extent *extent; + struct extent_buffer *leaf; + int slot; + size_t item_size; + int ret; + struct btrfs_key newkey = { + .objectid = oldkey->objectid + frontpad, + .type = BTRFS_RAID_STRIPE_KEY, + .offset = newlen, + }; + + ASSERT(oldkey->type == BTRFS_RAID_STRIPE_KEY); + ret = btrfs_duplicate_item(trans, stripe_root, path, &newkey); + if (ret) + return ret; + + leaf = path->nodes[0]; + slot = path->slots[0]; + item_size = btrfs_item_size(leaf, slot); + extent = btrfs_item_ptr(leaf, slot, struct btrfs_stripe_extent); + + for (int i = 0; i < btrfs_num_raid_stripes(item_size); i++) { + struct btrfs_raid_stride *stride = &extent->strides[i]; + u64 phys; + + phys = btrfs_raid_stride_physical(leaf, stride); + btrfs_set_raid_stride_physical(leaf, stride, phys + frontpad); + } + + btrfs_mark_buffer_dirty(trans, leaf); + + /* delete the old item, after we've inserted a new one. */ + path->slots[0]--; + ret = btrfs_del_item(trans, stripe_root, path); + + return ret; +} + int btrfs_delete_raid_extent(struct btrfs_trans_handle *trans, u64 start, u64 length) { struct btrfs_fs_info *fs_info = trans->fs_info; @@ -36,23 +80,24 @@ int btrfs_delete_raid_extent(struct btrfs_trans_handle *trans, u64 start, u64 le while (1) { key.objectid = start; key.type = BTRFS_RAID_STRIPE_KEY; - key.offset = length; + key.offset = 0; ret = btrfs_search_slot(trans, stripe_root, &key, path, -1, 1); if (ret < 0) break; - if (ret > 0) { - ret = 0; - if (path->slots[0] == 0) - break; + + if (path->slots[0] == btrfs_header_nritems(path->nodes[0])) path->slots[0]--; - } leaf = path->nodes[0]; slot = path->slots[0]; btrfs_item_key_to_cpu(leaf, &key, slot); found_start = key.objectid; found_end = found_start + key.offset; + ret = 0; + + if (key.type != BTRFS_RAID_STRIPE_KEY) + break; /* That stripe ends before we start, we're done. */ if (found_end <= start) @@ -61,7 +106,42 @@ int btrfs_delete_raid_extent(struct btrfs_trans_handle *trans, u64 start, u64 le trace_btrfs_raid_extent_delete(fs_info, start, end, found_start, found_end); - ASSERT(found_start >= start && found_end <= end); + /* + * The stripe extent starts before the range we want to delete: + * + * |--- RAID Stripe Extent ---| + * |--- keep ---|--- drop ---| + * + * This means we have to duplicate the tree item, truncate the + * length to the new size and then re-insert the item. + */ + if (found_start < start) { + u64 diff = start - found_start; + + ret = btrfs_partially_delete_raid_extent(trans, path, + &key, + diff, 0); + break; + } + + /* + * The stripe extent ends after the range we want to delete: + * + * |--- RAID Stripe Extent ---| + * |--- drop ---|--- keep ---| + * + * This means we have to duplicate the tree item, truncate the + * length to the new size and then re-insert the item. + */ + if (found_end > end) { + u64 diff = found_end - end; + + ret = btrfs_partially_delete_raid_extent(trans, path, + &key, + diff, diff); + break; + } + ret = btrfs_del_item(trans, stripe_root, path); if (ret) break;