mbox series

[00/13] btrfs: zoned: fix active zone tracking issues

Message ID cover.1656909695.git.naohiro.aota@wdc.com (mailing list archive)
Headers show
Series btrfs: zoned: fix active zone tracking issues | expand

Message

Naohiro Aota July 4, 2022, 4:58 a.m. UTC
This series addresses mainly two issues on zoned btrfs' active zone
tracking and one issue which is a dependency of the main issue.

* Background

A ZNS drive has an upper limit of zones that simultaneously can be written
out. We call the limit max_active_zones. An active zone is deactivated when
we write fully to the zone, or when we explicitly send a REQ_OP_ZONE_FINISH
command to make it full.

The zoned btrfs must be aware of max_active_zones to use a ZNS drive. So,
we have an active zone tracking system that considers a block group as
active iff the underlying zone is active. In fact, we consider a block
group (and its underlying zones) as active when we start allocating from
it. Then, when the last region which can be allocated in the block group is
written, we send a REQ_OP_ZONE_FINISH command to each zone and consider the
block group as inactive.

So, in short, we currently depend on writing fully to a zone to finish a block group.

* Issues
** Issue A

In a certain situation, the current zoned btrfs's extent allocation fails
with an early -ENOSPC on a ZNS drive. When all the block groups do not have
enough space left for the allocation, it tries to allocate a new block
group if we can activate a new zone. If not, it returns -ENOSPC while the
device still has free space left.

** Issue B

When doing a buffered write, we call cow_file_range() to allocate the data
extent. The cow_file_range() works like an all-or-nothing manner: if it can
allocate for all the range it returns 0, or -ENOSPC if not. Thus, when all
the block group have small free space left, and btrfs cannot finish any
block group, the allocation partly succeed but fails in the end. This also
results in an early -ENOSPC.

We cannot finish any block group in a certain situation. Let's consider
that we have 8 active data block groups (forget about metadata/system block
groups here) and each of them has 1 MB free space left. Now, we want to do
10 MB buffered write. We can allocate blocks for the 8 of 10 MB. And, we
can no longer allocate from any block group. Furthermore, we cannot finish
any block group, because all the block groups have 1 MB reserved unwritten
space left now. And, since this 1 MB regions are owned by the allocating
process itself, simply waiting for the region to be written won't work.

** Issue C

To address issue A, we needed to disable metadata reservation
over-commit. That reveals that we under-estimate the number of extents to
be written on zoned btrfs. On zoned btrfs, we use a ZONE APPEND command to
write data, whose bio size is limited by max_zone_append_sectors and
max_segments. So, a data extent is always split at most at the size of the
limit. As a result, if BTRFS_MAX_EXTENT_SIZE is larger than the limit, we
tend to have more extents than expected from the estimation using
BTRFS_MAX_EXTENT_SIZE.

Since the metadata reservation is done before allocation (e.g, at
btrfs_buffered_write) and released afterward along with the delalloc
process or ordered extent creation. As a result, we can be short of the
metadata reservation in a certain situation, and can cause a WARN by that.

* Solutions
** For issue A

Issue A is that we can have early -ENOSPC if we cannot activate another
block group and no block group has enough space left.

To avoid the early -ENOSPC, we need to choose one block group and finish it
to make rooms for a new block group to be activated. But, that is only
possible from the data extent allocation context. From the metadata
context, we can cause a deadlock because we might need to wait for a
running transaction to make the finishing block group read-only.

So, we use two different methods for data allocation and metadata
allocation. For data allocation, we can finish a block group on-demand from
btrfs_reserve_extent() context. The finishing block group will be the block
group with a least free space left.

For metadata allocation, we use flush_space() to ensure that reserved bytes
can be written into active block groups. To do so, we track active block
groups' total bytes as active_total_bytes, and activate a block group
on-demand from flush_space().

Also, a newly allocated block group from some contexts must be activated

** For issue B

Issue B is about when we cannot allocate space from any block group, and we
cannot finish any block group. This issue only occurs when allocating a
data extent, because metadata reservation is ensured to be contained in
active block groups by solution for issue A.

In this case, writing out the partially allocated region will close the gap
between the allocation pointer and the capacity of the block group, make
the zone finished, and opens up rooms to activate a new block group. So,
this series implements the partial writing out and retrying of the
alloction.

In a certain case, we can't allocate anything from the block groups. In
that case, we'd expect there is on-going IOs to finish a block group. So,
we wait for it and retry the allocation.

** For issue C

Issue C is about that we underestimate the number of extents to be written
on zoned btrfs, because we don't expect an ordered extent is split by the
size of a bio.

We need to use a proper extent size limit to fix issue C. For that, we
revive the fs_info->max_zone_append_size and use it to calculate
count_max_extents(). Technically, the bio size is also limited by the
max_segments, so the limit is also capped by it.

* Patch structure
 
The fix for issue C comes first because it is a dependency of the fixes for
issue A and B.

Patches 1 to 5 address issue C by reviving fs_info->max_zone_append_bytes
and use it to replace BTRFS_MAX_EXTENT_SIZE on zoned btrfs.

Patches 6 to 11 address issue A. In detail, patch 7 fixes the data
allocation by finishing a block group when we cannot activate another block
group. Patch 10 fixes the metadata allocation by finishing a block group at
space reservation time.

Patches 12 and 13 address issue B by writing out a successfully allocated
part first and retrying the rest allocation.

Naohiro Aota (13):
  block: add bdev_max_segments() helper
  btrfs: zoned: revive max_zone_append_bytes
  btrfs: replace BTRFS_MAX_EXTENT_SIZE with fs_info->max_extent_size
  btrfs: convert count_max_extents() to use fs_info->max_extent_size
  btrfs: use fs_info->max_extent_size in get_extent_max_capacity()
  btrfs: let can_allocate_chunk return int
  btrfs: zoned: finish least available block group on data BG allocation
  btrfs: zoned: introduce space_info->active_total_bytes
  btrfs: zoned: disable metadata overcommit for zoned
  btrfs: zoned: activate metadata BG on flush_space
  btrfs: zoned: activate necessary block group
  btrfs: zoned: write out partially allocated region
  btrfs: zoned: wait until zone is finished when allocation didn't
    progress

 fs/btrfs/block-group.c    |  23 +++++++-
 fs/btrfs/ctree.h          |  25 +++++---
 fs/btrfs/delalloc-space.c |   6 +-
 fs/btrfs/disk-io.c        |   3 +
 fs/btrfs/extent-tree.c    |  64 +++++++++++++++-----
 fs/btrfs/extent_io.c      |   3 +-
 fs/btrfs/inode.c          |  90 ++++++++++++++++++++--------
 fs/btrfs/ioctl.c          |  11 ++--
 fs/btrfs/space-info.c     |  66 +++++++++++++++++----
 fs/btrfs/space-info.h     |   4 +-
 fs/btrfs/zoned.c          | 119 ++++++++++++++++++++++++++++++++++++++
 fs/btrfs/zoned.h          |  18 ++++++
 include/linux/blkdev.h    |   5 ++
 13 files changed, 368 insertions(+), 69 deletions(-)

Comments

David Sterba July 8, 2022, 6:01 p.m. UTC | #1
On Mon, Jul 04, 2022 at 01:58:04PM +0900, Naohiro Aota wrote:
> This series addresses mainly two issues on zoned btrfs' active zone
> tracking and one issue which is a dependency of the main issue.
> 
> * Background
[...]

Thanks for the writeup, this seems to be fixing a serious problem, also
guessing by the length of the series. Some of the patches are marked for
stable 5.12 or 5.16 but I think this would need to be backported
manually and to 5.18 as the other versions have been EOLed.

As most of the changes are in zoned code I can add the whole series to
misc-next rather sooner than later because the code freeze is near.

I did a quick test and it crashes in the self tests so I can't add the
branch to for-next.

[   13.324894] Btrfs loaded, crc32c=crc32c-generic, debug=on, assert=on, integrity-checker=on, ref-verify=on, zoned=yes, fsverity=yes
[   13.326507] BTRFS: selftest: sectorsize: 4096  nodesize: 4096
[   13.327303] BTRFS: selftest: running btrfs free space cache tests
[   13.328133] BTRFS: selftest: running extent only tests
[   13.328935] BTRFS: selftest: running bitmap only tests
[   13.329770] BTRFS: selftest: running bitmap and extent tests
[   13.330647] BTRFS: selftest: running space stealing from bitmap to extent tests
[   13.331990] BTRFS: selftest: running bytes index tests
[   13.332915] BTRFS: selftest: running extent buffer operation tests
[   13.333924] BTRFS: selftest: running btrfs_split_item tests
[   13.334922] BTRFS: selftest: running extent I/O tests
[   13.335733] BTRFS: selftest: running find delalloc tests
[   13.525595] BUG: unable to handle page fault for address: 0000000000002360
[   13.526677] #PF: supervisor read access in kernel mode
[   13.527480] #PF: error_code(0x0000) - not-present page
[   13.528381] PGD 0 P4D 0 
[   13.528909] Oops: 0000 [#1] PREEMPT SMP
[   13.529604] CPU: 0 PID: 642 Comm: modprobe Not tainted 5.19.0-rc5-default+ #1809
[   13.530742] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a-rebuilt.opensuse.org 04/01/2014
[   13.532475] RIP: 0010:find_lock_delalloc_range+0x41/0x2a0 [btrfs]
[   13.535137] RSP: 0018:ffffb750c05ebcd8 EFLAGS: 00010296
[   13.535467] RAX: 0000000000000000 RBX: ffff96b1fe0f3440 RCX: 0000000000000fff
[   13.535880] RDX: ffffb750c05ebd60 RSI: ffff96b1fe0f3440 RDI: ffff96b1838b8f00
[   13.536534] RBP: ffff96b1838b8f00 R08: 0000000000000000 R09: 0000000000000001
[   13.537141] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000001000
[   13.537548] R13: ffff96b1838b8ab8 R14: 0000000000000fff R15: ffffb750c05ebd58
[   13.537956] FS:  00007eff49b43740(0000) GS:ffff96b1fd600000(0000) knlGS:0000000000000000
[   13.538467] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   13.538808] CR2: 0000000000002360 CR3: 0000000003f1c000 CR4: 00000000000006b0
[   13.539213] Call Trace:
[   13.539407]  <TASK>
[   13.539584]  test_find_delalloc+0x19b/0x695 [btrfs]
[   13.539978]  btrfs_test_extent_io+0x1e/0x39 [btrfs]
[   13.540710]  btrfs_run_sanity_tests.cold+0x33/0xcd [btrfs]
[   13.541686]  init_btrfs_fs+0xcc/0x12b [btrfs]
[   13.542328]  ? 0xffffffffc060a000
[   13.542868]  do_one_initcall+0x65/0x330
[   13.543348]  ? rcu_read_lock_sched_held+0x3b/0x70
[   13.543980]  ? trace_kmalloc+0x33/0xe0
[   13.544394]  ? kmem_cache_alloc_trace+0x188/0x270
[   13.544805]  do_init_module+0x4a/0x1f0
[   13.545076]  __do_sys_finit_module+0x9e/0xf0
[   13.545373]  do_syscall_64+0x3c/0x80
[   13.545618]  entry_SYSCALL_64_after_hwframe+0x46/0xb0
[   13.545961] RIP: 0033:0x7eff49c6da8d
[   13.547264] RSP: 002b:00007ffcb6e6bbc8 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
[   13.547724] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007eff49c6da8d
[   13.548184] RDX: 0000000000000000 RSI: 000055a8143faab2 RDI: 000000000000000d
[   13.549012] RBP: 000055a81441d460 R08: 0000000000000000 R09: 0000000000000000
[   13.549550] R10: 000000000000000d R11: 0000000000000246 R12: 000055a8143faab2
[   13.549958] R13: 000055a814423530 R14: 0000000000000000 R15: 000055a814424ab8
[   13.550363]  </TASK>
[   13.550541] Modules linked in: btrfs(+) blake2b_generic libcrc32c xor lzo_compress lzo_decompress raid6_pq zstd_decompress zstd_compress xxhash loop
[   13.551275] CR2: 0000000000002360
[   13.551525] ---[ end trace 0000000000000000 ]---
[   13.551819] RIP: 0010:find_lock_delalloc_range+0x41/0x2a0 [btrfs]
[   13.555923] RSP: 0018:ffffb750c05ebcd8 EFLAGS: 00010296
[   13.557052] RAX: 0000000000000000 RBX: ffff96b1fe0f3440 RCX: 0000000000000fff
[   13.558614] RDX: ffffb750c05ebd60 RSI: ffff96b1fe0f3440 RDI: ffff96b1838b8f00
[   13.559857] RBP: ffff96b1838b8f00 R08: 0000000000000000 R09: 0000000000000001
[   13.561280] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000001000
[   13.562619] R13: ffff96b1838b8ab8 R14: 0000000000000fff R15: ffffb750c05ebd58
[   13.563522] FS:  00007eff49b43740(0000) GS:ffff96b1fd600000(0000) knlGS:0000000000000000
[   13.564612] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   13.565356] CR2: 0000000000002360 CR3: 0000000003f1c000 CR4: 00000000000006b0
Naohiro Aota July 8, 2022, 11:06 p.m. UTC | #2
On Fri, Jul 08, 2022 at 08:01:01PM +0200, David Sterba wrote:
> On Mon, Jul 04, 2022 at 01:58:04PM +0900, Naohiro Aota wrote:
> > This series addresses mainly two issues on zoned btrfs' active zone
> > tracking and one issue which is a dependency of the main issue.
> > 
> > * Background
> [...]
> 
> Thanks for the writeup, this seems to be fixing a serious problem, also
> guessing by the length of the series. Some of the patches are marked for
> stable 5.12 or 5.16 but I think this would need to be backported
> manually and to 5.18 as the other versions have been EOLed.
> 
> As most of the changes are in zoned code I can add the whole series to
> misc-next rather sooner than later because the code freeze is near.

Thank you.

> I did a quick test and it crashes in the self tests so I can't add the
> branch to for-next.
> 
> [   13.324894] Btrfs loaded, crc32c=crc32c-generic, debug=on, assert=on, integrity-checker=on, ref-verify=on, zoned=yes, fsverity=yes
> [   13.326507] BTRFS: selftest: sectorsize: 4096  nodesize: 4096
> [   13.327303] BTRFS: selftest: running btrfs free space cache tests
> [   13.328133] BTRFS: selftest: running extent only tests
> [   13.328935] BTRFS: selftest: running bitmap only tests
> [   13.329770] BTRFS: selftest: running bitmap and extent tests
> [   13.330647] BTRFS: selftest: running space stealing from bitmap to extent tests
> [   13.331990] BTRFS: selftest: running bytes index tests
> [   13.332915] BTRFS: selftest: running extent buffer operation tests
> [   13.333924] BTRFS: selftest: running btrfs_split_item tests
> [   13.334922] BTRFS: selftest: running extent I/O tests
> [   13.335733] BTRFS: selftest: running find delalloc tests
> [   13.525595] BUG: unable to handle page fault for address: 0000000000002360
> [   13.526677] #PF: supervisor read access in kernel mode
> [   13.527480] #PF: error_code(0x0000) - not-present page
> [   13.528381] PGD 0 P4D 0 
> [   13.528909] Oops: 0000 [#1] PREEMPT SMP
> [   13.529604] CPU: 0 PID: 642 Comm: modprobe Not tainted 5.19.0-rc5-default+ #1809
> [   13.530742] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a-rebuilt.opensuse.org 04/01/2014
> [   13.532475] RIP: 0010:find_lock_delalloc_range+0x41/0x2a0 [btrfs]
> [   13.535137] RSP: 0018:ffffb750c05ebcd8 EFLAGS: 00010296
> [   13.535467] RAX: 0000000000000000 RBX: ffff96b1fe0f3440 RCX: 0000000000000fff
> [   13.535880] RDX: ffffb750c05ebd60 RSI: ffff96b1fe0f3440 RDI: ffff96b1838b8f00
> [   13.536534] RBP: ffff96b1838b8f00 R08: 0000000000000000 R09: 0000000000000001
> [   13.537141] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000001000
> [   13.537548] R13: ffff96b1838b8ab8 R14: 0000000000000fff R15: ffffb750c05ebd58
> [   13.537956] FS:  00007eff49b43740(0000) GS:ffff96b1fd600000(0000) knlGS:0000000000000000
> [   13.538467] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   13.538808] CR2: 0000000000002360 CR3: 0000000003f1c000 CR4: 00000000000006b0
> [   13.539213] Call Trace:
> [   13.539407]  <TASK>
> [   13.539584]  test_find_delalloc+0x19b/0x695 [btrfs]
> [   13.539978]  btrfs_test_extent_io+0x1e/0x39 [btrfs]
> [   13.540710]  btrfs_run_sanity_tests.cold+0x33/0xcd [btrfs]
> [   13.541686]  init_btrfs_fs+0xcc/0x12b [btrfs]
> [   13.542328]  ? 0xffffffffc060a000
> [   13.542868]  do_one_initcall+0x65/0x330
> [   13.543348]  ? rcu_read_lock_sched_held+0x3b/0x70
> [   13.543980]  ? trace_kmalloc+0x33/0xe0
> [   13.544394]  ? kmem_cache_alloc_trace+0x188/0x270
> [   13.544805]  do_init_module+0x4a/0x1f0
> [   13.545076]  __do_sys_finit_module+0x9e/0xf0
> [   13.545373]  do_syscall_64+0x3c/0x80
> [   13.545618]  entry_SYSCALL_64_after_hwframe+0x46/0xb0
> [   13.545961] RIP: 0033:0x7eff49c6da8d
> [   13.547264] RSP: 002b:00007ffcb6e6bbc8 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
> [   13.547724] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007eff49c6da8d
> [   13.548184] RDX: 0000000000000000 RSI: 000055a8143faab2 RDI: 000000000000000d
> [   13.549012] RBP: 000055a81441d460 R08: 0000000000000000 R09: 0000000000000000
> [   13.549550] R10: 000000000000000d R11: 0000000000000246 R12: 000055a8143faab2
> [   13.549958] R13: 000055a814423530 R14: 0000000000000000 R15: 000055a814424ab8
> [   13.550363]  </TASK>
> [   13.550541] Modules linked in: btrfs(+) blake2b_generic libcrc32c xor lzo_compress lzo_decompress raid6_pq zstd_decompress zstd_compress xxhash loop
> [   13.551275] CR2: 0000000000002360
> [   13.551525] ---[ end trace 0000000000000000 ]---
> [   13.551819] RIP: 0010:find_lock_delalloc_range+0x41/0x2a0 [btrfs]
> [   13.555923] RSP: 0018:ffffb750c05ebcd8 EFLAGS: 00010296
> [   13.557052] RAX: 0000000000000000 RBX: ffff96b1fe0f3440 RCX: 0000000000000fff
> [   13.558614] RDX: ffffb750c05ebd60 RSI: ffff96b1fe0f3440 RDI: ffff96b1838b8f00
> [   13.559857] RBP: ffff96b1838b8f00 R08: 0000000000000000 R09: 0000000000000001
> [   13.561280] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000001000
> [   13.562619] R13: ffff96b1838b8ab8 R14: 0000000000000fff R15: ffffb750c05ebd58
> [   13.563522] FS:  00007eff49b43740(0000) GS:ffff96b1fd600000(0000) knlGS:0000000000000000
> [   13.564612] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   13.565356] CR2: 0000000000002360 CR3: 0000000003f1c000 CR4: 00000000000006b0

Yes. This is also pointed out by Johannes. I'm going to send v2 with this fixed.