Message ID | 72946446ca8eda4477e0a11ea0b9d15cb05aa1e1.1731310741.git.naohiro.aota@wdc.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | btrfs: zoned: implement ZONE_RESET space_info reclaiming | expand |
On 11.11.24 08:49, Naohiro Aota wrote: > + > +/* > + * Reset the zones of unused block groups to free up @space_info->bytes_zone_unusable. > + * > + * @space_info: the space to work on > + * @num_bytes: targeting reclaim bytes > + */ Maybe add a comment what's the difference between this and btrfs_delete_unused_bgs()? > +int btrfs_reset_unused_block_groups(struct btrfs_space_info *space_info, u64 num_bytes) Maybe call it btrfs_space_info_reset_unused_zones()? > +{ > + struct btrfs_fs_info *fs_info = space_info->fs_info; > + > + if (!btrfs_is_zoned(fs_info)) > + return 0; > + > + while (num_bytes > 0) { > + struct btrfs_chunk_map *map; > + struct btrfs_block_group *bg = NULL; > + bool found = false; > + u64 reclaimed = 0; > + > + /* > + * Here, we choose a fully zone_unusable block group. It's > + * technically possible to reset a partly zone_unusable block > + * group, which still has some free space left. However, > + * handling that needs to cope with the allocation side, which > + * makes the logic more complex. So, let's handle the easy case > + * for now. > + */ > + scoped_guard(spinlock, &fs_info->unused_bgs_lock) { Again, not a fan of the scoped_guard() macro... > + list_for_each_entry(bg, &fs_info->unused_bgs, bg_list) { > + if ((bg->flags & BTRFS_BLOCK_GROUP_TYPE_MASK) != space_info->flags) > + continue; > + > + if (!spin_trylock(&bg->lock)) > + continue; especially as normal spin_lock() calls are mixed in. > + if (btrfs_is_block_group_used(bg) || > + bg->zone_unusable < bg->length) { > + spin_unlock(&bg->lock); > + continue; > + } > + spin_unlock(&bg->lock); > + found = true; > + break; > + } > + if (!found) > + return 0; > + > + list_del_init(&bg->bg_list); > + btrfs_put_block_group(bg); > + } > + > + /* > + * Since the block group is fully zone_unusable and we cannot > + * allocate anymore from this block group, we don't need to set > + * this block group read-only. > + */ > + > + scoped_guard(rwsem_read, &fs_info->dev_replace.rwsem) { > + const sector_t zone_size_sectors = fs_info->zone_size >> SECTOR_SHIFT; > + > + map = bg->physical_map; > + for (int i = 0; i < map->num_stripes; i++) { > + struct btrfs_io_stripe *stripe = &map->stripes[i]; > + unsigned int nofs_flags; > + int ret; > + > + nofs_flags = memalloc_nofs_save(); > + ret = blkdev_zone_mgmt(stripe->dev->bdev, REQ_OP_ZONE_RESET, > + stripe->physical >> SECTOR_SHIFT, > + zone_size_sectors); > + memalloc_nofs_restore(nofs_flags); > + > + if (ret) > + return ret; > + } > + } > + > + scoped_guard(spinlock, &space_info->lock) { > + scoped_guard(spinlock, &bg->lock) { > + ASSERT(!btrfs_is_block_group_used(bg)); > + if (bg->ro) > + continue; > + > + reclaimed = bg->alloc_offset; > + bg->zone_unusable = bg->length - bg->zone_capacity; > + bg->alloc_offset = 0; > + /* > + * This holds because we currently reset fully > + * used then freed BG. > + */ > + ASSERT(reclaimed == bg->zone_capacity); > + bg->free_space_ctl->free_space += reclaimed; > + space_info->bytes_zone_unusable -= reclaimed; btrfs_space_info_update_bytes_zone_unusable(space_info, -reclaimed); Which perfectly fits once we get rid of the two scoped_guard() macros. > + } > + btrfs_return_free_space(space_info, reclaimed); > + } > + > + if (num_bytes <= reclaimed) > + break; > + num_bytes -= reclaimed; > + } > + > + return 0; > +}
On Mon, Nov 11, 2024 at 09:04:39AM +0000, Johannes Thumshirn wrote: > > + /* > > + * Here, we choose a fully zone_unusable block group. It's > > + * technically possible to reset a partly zone_unusable block > > + * group, which still has some free space left. However, > > + * handling that needs to cope with the allocation side, which > > + * makes the logic more complex. So, let's handle the easy case > > + * for now. > > + */ > > + scoped_guard(spinlock, &fs_info->unused_bgs_lock) { > > Again, not a fan of the scoped_guard() macro... Yeah, we may use the scoped locking for something in the future but so far it would be quite confusing to mix explicit locking with scoped for current data structures. Subjectively speaking I don't want to use that at all.
On Mon, Nov 11, 2024 at 09:04:39AM GMT, Johannes Thumshirn wrote: > On 11.11.24 08:49, Naohiro Aota wrote: > > + > > +/* > > + * Reset the zones of unused block groups to free up @space_info->bytes_zone_unusable. > > + * > > + * @space_info: the space to work on > > + * @num_bytes: targeting reclaim bytes > > + */ > > Maybe add a comment what's the difference between this and > btrfs_delete_unused_bgs()? OK. This function allow reusing the block group space without removing them. So, it is like freeing region on the non-zoned mode. It is faster that btrfs_delete_unused_bgs() which just remove a block group and need re-allocation if we want to reuse the zone region. > > > +int btrfs_reset_unused_block_groups(struct btrfs_space_info *space_info, u64 num_bytes) > > Maybe call it btrfs_space_info_reset_unused_zones()? Hmm, currently, we only call this function from the space_info reclaim context. But, like btrfs_run_delayed_refs() or btrfs_commit_current_transaction(), we may be able to use this function from other other contexts (e.g., extent allocation may want this?). So, I'm not sure sticking with "space_info" is a good option. > > > +{ > > + struct btrfs_fs_info *fs_info = space_info->fs_info; > > + > > + if (!btrfs_is_zoned(fs_info)) > > + return 0; > > + > > + while (num_bytes > 0) { > > + struct btrfs_chunk_map *map; > > + struct btrfs_block_group *bg = NULL; > > + bool found = false; > > + u64 reclaimed = 0; > > + > > + /* > > + * Here, we choose a fully zone_unusable block group. It's > > + * technically possible to reset a partly zone_unusable block > > + * group, which still has some free space left. However, > > + * handling that needs to cope with the allocation side, which > > + * makes the logic more complex. So, let's handle the easy case > > + * for now. > > + */ > > + scoped_guard(spinlock, &fs_info->unused_bgs_lock) { > > Again, not a fan of the scoped_guard() macro... > > > + list_for_each_entry(bg, &fs_info->unused_bgs, bg_list) { > > + if ((bg->flags & BTRFS_BLOCK_GROUP_TYPE_MASK) != space_info->flags) > > + continue; > > + > > + if (!spin_trylock(&bg->lock)) > > + continue; > > especially as normal spin_lock() calls are mixed in. Yeah, this is nasty... there is scoped_cond_guard() variant which can use spin_trylock. But, since it uses "for (...)" internally, using it will break "continue" below. So, I avoided using that here. > > > + if (btrfs_is_block_group_used(bg) || > > + bg->zone_unusable < bg->length) { > > + spin_unlock(&bg->lock); > > + continue; > > + } > > + spin_unlock(&bg->lock); > > + found = true; > > + break; > > + } > > + if (!found) > > + return 0; > > + > > + list_del_init(&bg->bg_list); > > + btrfs_put_block_group(bg); > > + } > > + > > + /* > > + * Since the block group is fully zone_unusable and we cannot > > + * allocate anymore from this block group, we don't need to set > > + * this block group read-only. > > + */ > > + > > + scoped_guard(rwsem_read, &fs_info->dev_replace.rwsem) { > > + const sector_t zone_size_sectors = fs_info->zone_size >> SECTOR_SHIFT; > > + > > + map = bg->physical_map; > > + for (int i = 0; i < map->num_stripes; i++) { > > + struct btrfs_io_stripe *stripe = &map->stripes[i]; > > + unsigned int nofs_flags; > > + int ret; > > + > > + nofs_flags = memalloc_nofs_save(); > > + ret = blkdev_zone_mgmt(stripe->dev->bdev, REQ_OP_ZONE_RESET, > > + stripe->physical >> SECTOR_SHIFT, > > + zone_size_sectors); > > + memalloc_nofs_restore(nofs_flags); > > + > > + if (ret) > > + return ret; > > + } > > + } > > + > > + scoped_guard(spinlock, &space_info->lock) { > > + scoped_guard(spinlock, &bg->lock) { > > + ASSERT(!btrfs_is_block_group_used(bg)); > > + if (bg->ro) > > + continue; > > + > > + reclaimed = bg->alloc_offset; > > + bg->zone_unusable = bg->length - bg->zone_capacity; > > + bg->alloc_offset = 0; > > + /* > > + * This holds because we currently reset fully > > + * used then freed BG. > > + */ > > + ASSERT(reclaimed == bg->zone_capacity); > > + bg->free_space_ctl->free_space += reclaimed; > > + space_info->bytes_zone_unusable -= reclaimed; > > btrfs_space_info_update_bytes_zone_unusable(space_info, > -reclaimed); > > Which perfectly fits once we get rid of the two scoped_guard() macros. > > > + } > > + btrfs_return_free_space(space_info, reclaimed); > > + } > > + > > + if (num_bytes <= reclaimed) > > + break; > > + num_bytes -= reclaimed; > > + } > > + > > + return 0; > > +} >
On Mon, Nov 11, 2024 at 03:12:34PM GMT, David Sterba wrote: > On Mon, Nov 11, 2024 at 09:04:39AM +0000, Johannes Thumshirn wrote: > > > + /* > > > + * Here, we choose a fully zone_unusable block group. It's > > > + * technically possible to reset a partly zone_unusable block > > > + * group, which still has some free space left. However, > > > + * handling that needs to cope with the allocation side, which > > > + * makes the logic more complex. So, let's handle the easy case > > > + * for now. > > > + */ > > > + scoped_guard(spinlock, &fs_info->unused_bgs_lock) { > > > > Again, not a fan of the scoped_guard() macro... > > Yeah, we may use the scoped locking for something in the future but so > far it would be quite confusing to mix explicit locking with scoped for > current data structures. Subjectively speaking I don't want to use that > at all. As I wrote in another reply, mixing happens due to usage of "continue" in the outside loop. Well, I'll rewrite this with the ordinal locking style.
diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c index b94cd43d9495..8a0a347652f1 100644 --- a/fs/btrfs/space-info.c +++ b/fs/btrfs/space-info.c @@ -14,6 +14,7 @@ #include "fs.h" #include "accessors.h" #include "extent-tree.h" +#include "zoned.h" /* * HOW DOES SPACE RESERVATION WORK @@ -127,6 +128,14 @@ * churn a lot and we can avoid making some extent tree modifications if we * are able to delay for as long as possible. * + * RESET_ZONES + * This state works only for the zoned mode. On the zoned mode, we cannot + * reuse once allocated then freed region until we reset the zone, due to + * the sequential write zone requirement. The RESET_ZONES state resets the + * zones of an unused block group and let btrfs reuse the space. The reusing + * is faster than removing the block group and allocating another block + * group on the zones. + * * ALLOC_CHUNK * We will skip this the first time through space reservation, because of * overcommit and we don't want to have a lot of useless metadata space when @@ -832,6 +841,9 @@ static void flush_space(struct btrfs_fs_info *fs_info, */ ret = btrfs_commit_current_transaction(root); break; + case RESET_ZONES: + ret = btrfs_reset_unused_block_groups(space_info, num_bytes); + break; default: ret = -ENOSPC; break; @@ -1084,9 +1096,14 @@ static void btrfs_async_reclaim_metadata_space(struct work_struct *work) enum btrfs_flush_state flush_state; int commit_cycles = 0; u64 last_tickets_id; + enum btrfs_flush_state final_state; fs_info = container_of(work, struct btrfs_fs_info, async_reclaim_work); space_info = btrfs_find_space_info(fs_info, BTRFS_BLOCK_GROUP_METADATA); + if (btrfs_is_zoned(fs_info)) + final_state = RESET_ZONES; + else + final_state = COMMIT_TRANS; spin_lock(&space_info->lock); to_reclaim = btrfs_calc_reclaim_metadata_size(fs_info, space_info); @@ -1139,7 +1156,7 @@ static void btrfs_async_reclaim_metadata_space(struct work_struct *work) if (flush_state == ALLOC_CHUNK_FORCE && !commit_cycles) flush_state++; - if (flush_state > COMMIT_TRANS) { + if (flush_state > final_state) { commit_cycles++; if (commit_cycles > 2) { if (maybe_fail_all_tickets(fs_info, space_info)) { @@ -1153,7 +1170,7 @@ static void btrfs_async_reclaim_metadata_space(struct work_struct *work) } } spin_unlock(&space_info->lock); - } while (flush_state <= COMMIT_TRANS); + } while (flush_state <= final_state); } /* @@ -1284,6 +1301,10 @@ static void btrfs_preempt_reclaim_metadata_space(struct work_struct *work) * This is where we reclaim all of the pinned space generated by running the * iputs * + * RESET_ZONES + * This state works only for the zoned mode. We scan the unused block + * group list and reset the zones and reuse the block group. + * * ALLOC_CHUNK_FORCE * For data we start with alloc chunk force, however we could have been full * before, and then the transaction commit could have freed new block groups, @@ -1293,6 +1314,7 @@ static const enum btrfs_flush_state data_flush_states[] = { FLUSH_DELALLOC_FULL, RUN_DELAYED_IPUTS, COMMIT_TRANS, + RESET_ZONES, ALLOC_CHUNK_FORCE, }; @@ -1384,6 +1406,7 @@ void btrfs_init_async_reclaim_work(struct btrfs_fs_info *fs_info) static const enum btrfs_flush_state priority_flush_states[] = { FLUSH_DELAYED_ITEMS_NR, FLUSH_DELAYED_ITEMS, + RESET_ZONES, ALLOC_CHUNK, }; @@ -1397,6 +1420,7 @@ static const enum btrfs_flush_state evict_flush_states[] = { FLUSH_DELALLOC_FULL, ALLOC_CHUNK, COMMIT_TRANS, + RESET_ZONES, }; static void priority_reclaim_metadata_space(struct btrfs_fs_info *fs_info, diff --git a/fs/btrfs/space-info.h b/fs/btrfs/space-info.h index 69071afc0d47..a96efdb5e681 100644 --- a/fs/btrfs/space-info.h +++ b/fs/btrfs/space-info.h @@ -79,6 +79,10 @@ enum btrfs_reserve_flush_enum { BTRFS_RESERVE_FLUSH_EMERGENCY, }; +/* + * Please be aware that the order of enum values will be the order of the reclaim + * process in btrfs_async_reclaim_metadata_space(). + */ enum btrfs_flush_state { FLUSH_DELAYED_ITEMS_NR = 1, FLUSH_DELAYED_ITEMS = 2, @@ -91,6 +95,7 @@ enum btrfs_flush_state { ALLOC_CHUNK_FORCE = 9, RUN_DELAYED_IPUTS = 10, COMMIT_TRANS = 11, + RESET_ZONES = 12, }; struct btrfs_space_info { diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c index cb32966380f5..f939c2d7bb0b 100644 --- a/fs/btrfs/zoned.c +++ b/fs/btrfs/zoned.c @@ -2648,3 +2648,107 @@ void btrfs_check_active_zone_reservation(struct btrfs_fs_info *fs_info) } spin_unlock(&fs_info->zone_active_bgs_lock); } + +/* + * Reset the zones of unused block groups to free up @space_info->bytes_zone_unusable. + * + * @space_info: the space to work on + * @num_bytes: targeting reclaim bytes + */ +int btrfs_reset_unused_block_groups(struct btrfs_space_info *space_info, u64 num_bytes) +{ + struct btrfs_fs_info *fs_info = space_info->fs_info; + + if (!btrfs_is_zoned(fs_info)) + return 0; + + while (num_bytes > 0) { + struct btrfs_chunk_map *map; + struct btrfs_block_group *bg = NULL; + bool found = false; + u64 reclaimed = 0; + + /* + * Here, we choose a fully zone_unusable block group. It's + * technically possible to reset a partly zone_unusable block + * group, which still has some free space left. However, + * handling that needs to cope with the allocation side, which + * makes the logic more complex. So, let's handle the easy case + * for now. + */ + scoped_guard(spinlock, &fs_info->unused_bgs_lock) { + list_for_each_entry(bg, &fs_info->unused_bgs, bg_list) { + if ((bg->flags & BTRFS_BLOCK_GROUP_TYPE_MASK) != space_info->flags) + continue; + + if (!spin_trylock(&bg->lock)) + continue; + if (btrfs_is_block_group_used(bg) || + bg->zone_unusable < bg->length) { + spin_unlock(&bg->lock); + continue; + } + spin_unlock(&bg->lock); + found = true; + break; + } + if (!found) + return 0; + + list_del_init(&bg->bg_list); + btrfs_put_block_group(bg); + } + + /* + * Since the block group is fully zone_unusable and we cannot + * allocate anymore from this block group, we don't need to set + * this block group read-only. + */ + + scoped_guard(rwsem_read, &fs_info->dev_replace.rwsem) { + const sector_t zone_size_sectors = fs_info->zone_size >> SECTOR_SHIFT; + + map = bg->physical_map; + for (int i = 0; i < map->num_stripes; i++) { + struct btrfs_io_stripe *stripe = &map->stripes[i]; + unsigned int nofs_flags; + int ret; + + nofs_flags = memalloc_nofs_save(); + ret = blkdev_zone_mgmt(stripe->dev->bdev, REQ_OP_ZONE_RESET, + stripe->physical >> SECTOR_SHIFT, + zone_size_sectors); + memalloc_nofs_restore(nofs_flags); + + if (ret) + return ret; + } + } + + scoped_guard(spinlock, &space_info->lock) { + scoped_guard(spinlock, &bg->lock) { + ASSERT(!btrfs_is_block_group_used(bg)); + if (bg->ro) + continue; + + reclaimed = bg->alloc_offset; + bg->zone_unusable = bg->length - bg->zone_capacity; + bg->alloc_offset = 0; + /* + * This holds because we currently reset fully + * used then freed BG. + */ + ASSERT(reclaimed == bg->zone_capacity); + bg->free_space_ctl->free_space += reclaimed; + space_info->bytes_zone_unusable -= reclaimed; + } + btrfs_return_free_space(space_info, reclaimed); + } + + if (num_bytes <= reclaimed) + break; + num_bytes -= reclaimed; + } + + return 0; +} diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h index 7612e6572605..9672bf4c3335 100644 --- a/fs/btrfs/zoned.h +++ b/fs/btrfs/zoned.h @@ -96,6 +96,7 @@ int btrfs_zone_finish_one_bg(struct btrfs_fs_info *fs_info); int btrfs_zoned_activate_one_bg(struct btrfs_fs_info *fs_info, struct btrfs_space_info *space_info, bool do_finish); void btrfs_check_active_zone_reservation(struct btrfs_fs_info *fs_info); +int btrfs_reset_unused_block_groups(struct btrfs_space_info *space_info, u64 num_bytes); #else /* CONFIG_BLK_DEV_ZONED */ static inline int btrfs_get_dev_zone_info_all_devices(struct btrfs_fs_info *fs_info) @@ -265,6 +266,12 @@ static inline int btrfs_zoned_activate_one_bg(struct btrfs_fs_info *fs_info, static inline void btrfs_check_active_zone_reservation(struct btrfs_fs_info *fs_info) { } +static inline int btrfs_reset_unused_block_groups(struct btrfs_space_info *space_info, + u64 num_bytes) +{ + return 0; +} + #endif static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos) diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h index 4df93ca9b7a8..aeaf079ff994 100644 --- a/include/trace/events/btrfs.h +++ b/include/trace/events/btrfs.h @@ -97,6 +97,7 @@ struct find_free_extent_ctl; EM( FLUSH_DELALLOC_FULL, "FLUSH_DELALLOC_FULL") \ EM( FLUSH_DELAYED_REFS_NR, "FLUSH_DELAYED_REFS_NR") \ EM( FLUSH_DELAYED_REFS, "FLUSH_DELAYED_REFS") \ + EM( RESET_ZONES, "RESET_ZONES") \ EM( ALLOC_CHUNK, "ALLOC_CHUNK") \ EM( ALLOC_CHUNK_FORCE, "ALLOC_CHUNK_FORCE") \ EM( RUN_DELAYED_IPUTS, "RUN_DELAYED_IPUTS") \
On the zoned mode, once used and freed region is still not reusable after the freeing. The underlying zone needs to be reset before reusing. Btrfs resets a zone when it removes a block group, and then new block group is allocated on the zones to reuse the zones. But, it is sometime too late to catch up with a write side. This commit introduces a new space-info reclaim method ZONE_RESET. That will pick a block group from the unused list and reset its zone to reuse the zone_unusable space. It is faster than removing the block group and re-creating a new block group on the same zones. For the first implementation, the ZONE_RESET is only applied to a block group whose region is fully zone_unusable. Reclaiming partial zone_unusable block group could be implemented later. Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> --- fs/btrfs/space-info.c | 28 +++++++++- fs/btrfs/space-info.h | 5 ++ fs/btrfs/zoned.c | 104 +++++++++++++++++++++++++++++++++++ fs/btrfs/zoned.h | 7 +++ include/trace/events/btrfs.h | 1 + 5 files changed, 143 insertions(+), 2 deletions(-)