diff mbox series

[1/4] btrfs: ensure pages are unlocked on cow_file_range() failure

Message ID 318b80987f74e1cf6bf4ab09aed2399538fc4f9e.1655391633.git.naohiro.aota@wdc.com (mailing list archive)
State New, archived
Headers show
Series btrfs: fix error handling of cow_file_range(unlock = 0) | expand

Commit Message

Naohiro Aota June 16, 2022, 3:45 p.m. UTC
There is a hung_task report on zoned btrfs like below.

https://github.com/naota/linux/issues/59

[  726.328648] INFO: task rocksdb:high0:11085 blocked for more than 241 seconds.
[  726.329839]       Not tainted 5.16.0-rc1+ #1
[  726.330484] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  726.331603] task:rocksdb:high0   state:D stack:    0 pid:11085 ppid: 11082 flags:0x00000000
[  726.331608] Call Trace:
[  726.331611]  <TASK>
[  726.331614]  __schedule+0x2e5/0x9d0
[  726.331622]  schedule+0x58/0xd0
[  726.331626]  io_schedule+0x3f/0x70
[  726.331629]  __folio_lock+0x125/0x200
[  726.331634]  ? find_get_entries+0x1bc/0x240
[  726.331638]  ? filemap_invalidate_unlock_two+0x40/0x40
[  726.331642]  truncate_inode_pages_range+0x5b2/0x770
[  726.331649]  truncate_inode_pages_final+0x44/0x50
[  726.331653]  btrfs_evict_inode+0x67/0x480
[  726.331658]  evict+0xd0/0x180
[  726.331661]  iput+0x13f/0x200
[  726.331664]  do_unlinkat+0x1c0/0x2b0
[  726.331668]  __x64_sys_unlink+0x23/0x30
[  726.331670]  do_syscall_64+0x3b/0xc0
[  726.331674]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  726.331677] RIP: 0033:0x7fb9490a171b
[  726.331681] RSP: 002b:00007fb943ffac68 EFLAGS: 00000246 ORIG_RAX: 0000000000000057
[  726.331684] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fb9490a171b
[  726.331686] RDX: 00007fb943ffb040 RSI: 000055a6bbe6ec20 RDI: 00007fb94400d300
[  726.331687] RBP: 00007fb943ffad00 R08: 0000000000000000 R09: 0000000000000000
[  726.331688] R10: 0000000000000031 R11: 0000000000000246 R12: 00007fb943ffb000
[  726.331690] R13: 00007fb943ffb040 R14: 0000000000000000 R15: 00007fb943ffd260
[  726.331693]  </TASK>

While we debug the issue, we found running fstests generic/551 on 5GB
non-zoned null_blk device in the emulated zoned mode also had a
similar hung issue.

Also, we can reproduce the same symptom with an error injected
cow_file_range() setup.

The hang occurs when cow_file_range() fails in the middle of
allocation. cow_file_range() called from do_allocation_zoned() can
split the give region ([start, end]) for allocation depending on
current block group usages. When btrfs can allocate bytes for one part
of the split regions but fails for the other region (e.g. because of
-ENOSPC), we return the error leaving the pages in the succeeded regions
locked. Technically, this occurs only when @unlock == 0. Otherwise, we
unlock the pages in an allocated region after creating an ordered
extent.

Considering the callers of cow_file_range(unlock=0) won't write out
the pages, we can unlock the pages on error exit from
cow_file_range(). So, we can ensure all the pages except @locked_page
are unlocked on error case.

In summary, cow_file_range now behaves like this:

- page_started == 1 (return value)
  - All the pages are unlocked. IO is started.
- unlock == 1
  - All the pages except @locked_page are unlocked in any case
- unlock == 0
  - On success, all the pages are locked for writing out them
  - On failure, all the pages except @locked_page are unlocked

Fixes: 42c011000963 ("btrfs: zoned: introduce dedicated data write path for zoned filesystems")
CC: stable@vger.kernel.org # 5.12+
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/inode.c | 72 ++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 64 insertions(+), 8 deletions(-)

Comments

Filipe Manana June 17, 2022, 10:11 a.m. UTC | #1
On Fri, Jun 17, 2022 at 12:45:39AM +0900, Naohiro Aota wrote:
> There is a hung_task report on zoned btrfs like below.
> 
> https://github.com/naota/linux/issues/59
> 
> [  726.328648] INFO: task rocksdb:high0:11085 blocked for more than 241 seconds.
> [  726.329839]       Not tainted 5.16.0-rc1+ #1
> [  726.330484] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [  726.331603] task:rocksdb:high0   state:D stack:    0 pid:11085 ppid: 11082 flags:0x00000000
> [  726.331608] Call Trace:
> [  726.331611]  <TASK>
> [  726.331614]  __schedule+0x2e5/0x9d0
> [  726.331622]  schedule+0x58/0xd0
> [  726.331626]  io_schedule+0x3f/0x70
> [  726.331629]  __folio_lock+0x125/0x200
> [  726.331634]  ? find_get_entries+0x1bc/0x240
> [  726.331638]  ? filemap_invalidate_unlock_two+0x40/0x40
> [  726.331642]  truncate_inode_pages_range+0x5b2/0x770
> [  726.331649]  truncate_inode_pages_final+0x44/0x50
> [  726.331653]  btrfs_evict_inode+0x67/0x480
> [  726.331658]  evict+0xd0/0x180
> [  726.331661]  iput+0x13f/0x200
> [  726.331664]  do_unlinkat+0x1c0/0x2b0
> [  726.331668]  __x64_sys_unlink+0x23/0x30
> [  726.331670]  do_syscall_64+0x3b/0xc0
> [  726.331674]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> [  726.331677] RIP: 0033:0x7fb9490a171b
> [  726.331681] RSP: 002b:00007fb943ffac68 EFLAGS: 00000246 ORIG_RAX: 0000000000000057
> [  726.331684] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fb9490a171b
> [  726.331686] RDX: 00007fb943ffb040 RSI: 000055a6bbe6ec20 RDI: 00007fb94400d300
> [  726.331687] RBP: 00007fb943ffad00 R08: 0000000000000000 R09: 0000000000000000
> [  726.331688] R10: 0000000000000031 R11: 0000000000000246 R12: 00007fb943ffb000
> [  726.331690] R13: 00007fb943ffb040 R14: 0000000000000000 R15: 00007fb943ffd260
> [  726.331693]  </TASK>
> 
> While we debug the issue, we found running fstests generic/551 on 5GB
> non-zoned null_blk device in the emulated zoned mode also had a
> similar hung issue.
> 
> Also, we can reproduce the same symptom with an error injected
> cow_file_range() setup.
> 
> The hang occurs when cow_file_range() fails in the middle of
> allocation. cow_file_range() called from do_allocation_zoned() can
> split the give region ([start, end]) for allocation depending on
> current block group usages. When btrfs can allocate bytes for one part
> of the split regions but fails for the other region (e.g. because of
> -ENOSPC), we return the error leaving the pages in the succeeded regions
> locked. Technically, this occurs only when @unlock == 0. Otherwise, we
> unlock the pages in an allocated region after creating an ordered
> extent.
> 
> Considering the callers of cow_file_range(unlock=0) won't write out
> the pages, we can unlock the pages on error exit from
> cow_file_range(). So, we can ensure all the pages except @locked_page
> are unlocked on error case.
> 
> In summary, cow_file_range now behaves like this:
> 
> - page_started == 1 (return value)
>   - All the pages are unlocked. IO is started.
> - unlock == 1
>   - All the pages except @locked_page are unlocked in any case
> - unlock == 0
>   - On success, all the pages are locked for writing out them
>   - On failure, all the pages except @locked_page are unlocked
> 
> Fixes: 42c011000963 ("btrfs: zoned: introduce dedicated data write path for zoned filesystems")
> CC: stable@vger.kernel.org # 5.12+
> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
> ---
>  fs/btrfs/inode.c | 72 ++++++++++++++++++++++++++++++++++++++++++------
>  1 file changed, 64 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 1247690e7021..0c3d9998470f 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -1134,6 +1134,28 @@ static u64 get_extent_allocation_hint(struct btrfs_inode *inode, u64 start,
>   * *page_started is set to one if we unlock locked_page and do everything
>   * required to start IO on it.  It may be clean and already done with
>   * IO when we return.
> + *
> + * When unlock == 1, we unlock the pages in successfully allocated regions.
> + * When unlock == 0, we leave them locked for writing them out.
> + *
> + * However, we unlock all the pages except @locked_page in case of failure.
> + *
> + * In summary, page locking state will be as follow:
> + *
> + * - page_started == 1 (return value)
> + *     - All the pages are unlocked. IO is started.
> + *     - Note that this can happen only on success
> + * - unlock == 1
> + *     - All the pages except @locked_page are unlocked in any case
> + * - unlock == 0
> + *     - On success, all the pages are locked for writing out them
> + *     - On failure, all the pages except @locked_page are unlocked
> + *
> + * When a failure happens in the second or later iteration of the
> + * while-loop, the ordered extents created in previous iterations are kept
> + * intact. So, the caller must clean them up by calling
> + * btrfs_cleanup_ordered_extents(). See btrfs_run_delalloc_range() for
> + * example.
>   */
>  static noinline int cow_file_range(struct btrfs_inode *inode,
>  				   struct page *locked_page,
> @@ -1143,6 +1165,7 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
>  	struct btrfs_root *root = inode->root;
>  	struct btrfs_fs_info *fs_info = root->fs_info;
>  	u64 alloc_hint = 0;
> +	u64 orig_start = start;
>  	u64 num_bytes;
>  	unsigned long ram_size;
>  	u64 cur_alloc_size = 0;
> @@ -1336,18 +1359,44 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
>  	btrfs_dec_block_group_reservations(fs_info, ins.objectid);
>  	btrfs_free_reserved_extent(fs_info, ins.objectid, ins.offset, 1);
>  out_unlock:
> +	/*
> +	 * Now, we have three regions to clean up, as shown below.
> +	 *
> +	 * |-------(1)----|---(2)---|-------------(3)----------|
> +	 * `- orig_start  `- start  `- start + cur_alloc_size  `- end
> +	 *
> +	 * We process each region below.
> +	 */
> +
>  	clear_bits = EXTENT_LOCKED | EXTENT_DELALLOC | EXTENT_DELALLOC_NEW |
>  		EXTENT_DEFRAG | EXTENT_CLEAR_META_RESV;
>  	page_ops = PAGE_UNLOCK | PAGE_START_WRITEBACK | PAGE_END_WRITEBACK;
> +
>  	/*
> -	 * If we reserved an extent for our delalloc range (or a subrange) and
> -	 * failed to create the respective ordered extent, then it means that
> -	 * when we reserved the extent we decremented the extent's size from
> -	 * the data space_info's bytes_may_use counter and incremented the
> -	 * space_info's bytes_reserved counter by the same amount. We must make
> -	 * sure extent_clear_unlock_delalloc() does not try to decrement again
> -	 * the data space_info's bytes_may_use counter, therefore we do not pass
> -	 * it the flag EXTENT_CLEAR_DATA_RESV.
> +	 * For the range (1). We have already instantiated the ordered extents
> +	 * for this region. They are cleaned up by
> +	 * btrfs_cleanup_ordered_extents() in e.g,
> +	 * btrfs_run_delalloc_range(). EXTENT_LOCKED | EXTENT_DELALLOC are
> +	 * already cleared in the above loop. And, EXTENT_DELALLOC_NEW |
> +	 * EXTENT_DEFRAG | EXTENT_CLEAR_META_RESV are handled by the cleanup
> +	 * function.
> +	 *
> +	 * However, in case of unlock == 0, we still need to unlock the pages
> +	 * (except @locked_page) to ensure all the pages are unlocked.
> +	 */
> +	if (!unlock && orig_start < start)
> +		extent_clear_unlock_delalloc(inode, orig_start, start - 1,
> +					     locked_page, 0, page_ops);
> +
> +	/*
> +	 * For the range (2). If we reserved an extent for our delalloc range
> +	 * (or a subrange) and failed to create the respective ordered extent,
> +	 * then it means that when we reserved the extent we decremented the
> +	 * extent's size from the data space_info's bytes_may_use counter and
> +	 * incremented the space_info's bytes_reserved counter by the same
> +	 * amount. We must make sure extent_clear_unlock_delalloc() does not try
> +	 * to decrement again the data space_info's bytes_may_use counter,
> +	 * therefore we do not pass it the flag EXTENT_CLEAR_DATA_RESV.
>  	 */
>  	if (extent_reserved) {
>  		extent_clear_unlock_delalloc(inode, start,
> @@ -1359,6 +1408,13 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
>  		if (start >= end)
>  			goto out;
>  	}
> +
> +	/*
> +	 * For the range (3). We never touched the region. In addition to the
> +	 * clear_bits above, we add EXTENT_CLEAR_DATA_RESV to release the data
> +	 * space_info's bytes_may_use counter, reserved in e.g,
> +	 * btrfs_check_data_free_space().

This says "in .e.g.", but in fact there's only one place where we increment the
bytes_may_use counter for data, which is at btrfs_check_data_free_space(). So
just remove the "in e.g." part because it gives the idea that the increment/reservation
can happen in more than one place.

Other than that, it looks good, thanks.

> +	 */
>  	extent_clear_unlock_delalloc(inode, start, end, locked_page,
>  				     clear_bits | EXTENT_CLEAR_DATA_RESV,
>  				     page_ops);
> -- 
> 2.35.1
>
Naohiro Aota June 20, 2022, 2:22 a.m. UTC | #2
On Fri, Jun 17, 2022 at 11:11:49AM +0100, Filipe Manana wrote:
> On Fri, Jun 17, 2022 at 12:45:39AM +0900, Naohiro Aota wrote:
> > There is a hung_task report on zoned btrfs like below.
> > 
> > https://github.com/naota/linux/issues/59
> > 
> > [  726.328648] INFO: task rocksdb:high0:11085 blocked for more than 241 seconds.
> > [  726.329839]       Not tainted 5.16.0-rc1+ #1
> > [  726.330484] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > [  726.331603] task:rocksdb:high0   state:D stack:    0 pid:11085 ppid: 11082 flags:0x00000000
> > [  726.331608] Call Trace:
> > [  726.331611]  <TASK>
> > [  726.331614]  __schedule+0x2e5/0x9d0
> > [  726.331622]  schedule+0x58/0xd0
> > [  726.331626]  io_schedule+0x3f/0x70
> > [  726.331629]  __folio_lock+0x125/0x200
> > [  726.331634]  ? find_get_entries+0x1bc/0x240
> > [  726.331638]  ? filemap_invalidate_unlock_two+0x40/0x40
> > [  726.331642]  truncate_inode_pages_range+0x5b2/0x770
> > [  726.331649]  truncate_inode_pages_final+0x44/0x50
> > [  726.331653]  btrfs_evict_inode+0x67/0x480
> > [  726.331658]  evict+0xd0/0x180
> > [  726.331661]  iput+0x13f/0x200
> > [  726.331664]  do_unlinkat+0x1c0/0x2b0
> > [  726.331668]  __x64_sys_unlink+0x23/0x30
> > [  726.331670]  do_syscall_64+0x3b/0xc0
> > [  726.331674]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> > [  726.331677] RIP: 0033:0x7fb9490a171b
> > [  726.331681] RSP: 002b:00007fb943ffac68 EFLAGS: 00000246 ORIG_RAX: 0000000000000057
> > [  726.331684] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fb9490a171b
> > [  726.331686] RDX: 00007fb943ffb040 RSI: 000055a6bbe6ec20 RDI: 00007fb94400d300
> > [  726.331687] RBP: 00007fb943ffad00 R08: 0000000000000000 R09: 0000000000000000
> > [  726.331688] R10: 0000000000000031 R11: 0000000000000246 R12: 00007fb943ffb000
> > [  726.331690] R13: 00007fb943ffb040 R14: 0000000000000000 R15: 00007fb943ffd260
> > [  726.331693]  </TASK>
> > 
> > While we debug the issue, we found running fstests generic/551 on 5GB
> > non-zoned null_blk device in the emulated zoned mode also had a
> > similar hung issue.
> > 
> > Also, we can reproduce the same symptom with an error injected
> > cow_file_range() setup.
> > 
> > The hang occurs when cow_file_range() fails in the middle of
> > allocation. cow_file_range() called from do_allocation_zoned() can
> > split the give region ([start, end]) for allocation depending on
> > current block group usages. When btrfs can allocate bytes for one part
> > of the split regions but fails for the other region (e.g. because of
> > -ENOSPC), we return the error leaving the pages in the succeeded regions
> > locked. Technically, this occurs only when @unlock == 0. Otherwise, we
> > unlock the pages in an allocated region after creating an ordered
> > extent.
> > 
> > Considering the callers of cow_file_range(unlock=0) won't write out
> > the pages, we can unlock the pages on error exit from
> > cow_file_range(). So, we can ensure all the pages except @locked_page
> > are unlocked on error case.
> > 
> > In summary, cow_file_range now behaves like this:
> > 
> > - page_started == 1 (return value)
> >   - All the pages are unlocked. IO is started.
> > - unlock == 1
> >   - All the pages except @locked_page are unlocked in any case
> > - unlock == 0
> >   - On success, all the pages are locked for writing out them
> >   - On failure, all the pages except @locked_page are unlocked
> > 
> > Fixes: 42c011000963 ("btrfs: zoned: introduce dedicated data write path for zoned filesystems")
> > CC: stable@vger.kernel.org # 5.12+
> > Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
> > ---
> >  fs/btrfs/inode.c | 72 ++++++++++++++++++++++++++++++++++++++++++------
> >  1 file changed, 64 insertions(+), 8 deletions(-)
> > 
> > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> > index 1247690e7021..0c3d9998470f 100644
> > --- a/fs/btrfs/inode.c
> > +++ b/fs/btrfs/inode.c
> > @@ -1134,6 +1134,28 @@ static u64 get_extent_allocation_hint(struct btrfs_inode *inode, u64 start,
> >   * *page_started is set to one if we unlock locked_page and do everything
> >   * required to start IO on it.  It may be clean and already done with
> >   * IO when we return.
> > + *
> > + * When unlock == 1, we unlock the pages in successfully allocated regions.
> > + * When unlock == 0, we leave them locked for writing them out.
> > + *
> > + * However, we unlock all the pages except @locked_page in case of failure.
> > + *
> > + * In summary, page locking state will be as follow:
> > + *
> > + * - page_started == 1 (return value)
> > + *     - All the pages are unlocked. IO is started.
> > + *     - Note that this can happen only on success
> > + * - unlock == 1
> > + *     - All the pages except @locked_page are unlocked in any case
> > + * - unlock == 0
> > + *     - On success, all the pages are locked for writing out them
> > + *     - On failure, all the pages except @locked_page are unlocked
> > + *
> > + * When a failure happens in the second or later iteration of the
> > + * while-loop, the ordered extents created in previous iterations are kept
> > + * intact. So, the caller must clean them up by calling
> > + * btrfs_cleanup_ordered_extents(). See btrfs_run_delalloc_range() for
> > + * example.
> >   */
> >  static noinline int cow_file_range(struct btrfs_inode *inode,
> >  				   struct page *locked_page,
> > @@ -1143,6 +1165,7 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
> >  	struct btrfs_root *root = inode->root;
> >  	struct btrfs_fs_info *fs_info = root->fs_info;
> >  	u64 alloc_hint = 0;
> > +	u64 orig_start = start;
> >  	u64 num_bytes;
> >  	unsigned long ram_size;
> >  	u64 cur_alloc_size = 0;
> > @@ -1336,18 +1359,44 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
> >  	btrfs_dec_block_group_reservations(fs_info, ins.objectid);
> >  	btrfs_free_reserved_extent(fs_info, ins.objectid, ins.offset, 1);
> >  out_unlock:
> > +	/*
> > +	 * Now, we have three regions to clean up, as shown below.
> > +	 *
> > +	 * |-------(1)----|---(2)---|-------------(3)----------|
> > +	 * `- orig_start  `- start  `- start + cur_alloc_size  `- end
> > +	 *
> > +	 * We process each region below.
> > +	 */
> > +
> >  	clear_bits = EXTENT_LOCKED | EXTENT_DELALLOC | EXTENT_DELALLOC_NEW |
> >  		EXTENT_DEFRAG | EXTENT_CLEAR_META_RESV;
> >  	page_ops = PAGE_UNLOCK | PAGE_START_WRITEBACK | PAGE_END_WRITEBACK;
> > +
> >  	/*
> > -	 * If we reserved an extent for our delalloc range (or a subrange) and
> > -	 * failed to create the respective ordered extent, then it means that
> > -	 * when we reserved the extent we decremented the extent's size from
> > -	 * the data space_info's bytes_may_use counter and incremented the
> > -	 * space_info's bytes_reserved counter by the same amount. We must make
> > -	 * sure extent_clear_unlock_delalloc() does not try to decrement again
> > -	 * the data space_info's bytes_may_use counter, therefore we do not pass
> > -	 * it the flag EXTENT_CLEAR_DATA_RESV.
> > +	 * For the range (1). We have already instantiated the ordered extents
> > +	 * for this region. They are cleaned up by
> > +	 * btrfs_cleanup_ordered_extents() in e.g,
> > +	 * btrfs_run_delalloc_range(). EXTENT_LOCKED | EXTENT_DELALLOC are
> > +	 * already cleared in the above loop. And, EXTENT_DELALLOC_NEW |
> > +	 * EXTENT_DEFRAG | EXTENT_CLEAR_META_RESV are handled by the cleanup
> > +	 * function.
> > +	 *
> > +	 * However, in case of unlock == 0, we still need to unlock the pages
> > +	 * (except @locked_page) to ensure all the pages are unlocked.
> > +	 */
> > +	if (!unlock && orig_start < start)
> > +		extent_clear_unlock_delalloc(inode, orig_start, start - 1,
> > +					     locked_page, 0, page_ops);
> > +
> > +	/*
> > +	 * For the range (2). If we reserved an extent for our delalloc range
> > +	 * (or a subrange) and failed to create the respective ordered extent,
> > +	 * then it means that when we reserved the extent we decremented the
> > +	 * extent's size from the data space_info's bytes_may_use counter and
> > +	 * incremented the space_info's bytes_reserved counter by the same
> > +	 * amount. We must make sure extent_clear_unlock_delalloc() does not try
> > +	 * to decrement again the data space_info's bytes_may_use counter,
> > +	 * therefore we do not pass it the flag EXTENT_CLEAR_DATA_RESV.
> >  	 */
> >  	if (extent_reserved) {
> >  		extent_clear_unlock_delalloc(inode, start,
> > @@ -1359,6 +1408,13 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
> >  		if (start >= end)
> >  			goto out;
> >  	}
> > +
> > +	/*
> > +	 * For the range (3). We never touched the region. In addition to the
> > +	 * clear_bits above, we add EXTENT_CLEAR_DATA_RESV to release the data
> > +	 * space_info's bytes_may_use counter, reserved in e.g,
> > +	 * btrfs_check_data_free_space().
> 
> This says "in .e.g.", but in fact there's only one place where we increment the
> bytes_may_use counter for data, which is at btrfs_check_data_free_space(). So
> just remove the "in e.g." part because it gives the idea that the increment/reservation
> can happen in more than one place.

We also have btrfs_alloc_data_chunk_ondemand() in btrfs_fallocate() or
btrfs_do_encoded_write(). But, yeah, for the normal write path calling
cow_file_range(), btrfs_check_data_free_space() is the only place. I'll
update the comment.

Thanks.

> Other than that, it looks good, thanks.
> 
> > +	 */
> >  	extent_clear_unlock_delalloc(inode, start, end, locked_page,
> >  				     clear_bits | EXTENT_CLEAR_DATA_RESV,
> >  				     page_ops);
> > -- 
> > 2.35.1
> >
diff mbox series

Patch

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 1247690e7021..0c3d9998470f 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1134,6 +1134,28 @@  static u64 get_extent_allocation_hint(struct btrfs_inode *inode, u64 start,
  * *page_started is set to one if we unlock locked_page and do everything
  * required to start IO on it.  It may be clean and already done with
  * IO when we return.
+ *
+ * When unlock == 1, we unlock the pages in successfully allocated regions.
+ * When unlock == 0, we leave them locked for writing them out.
+ *
+ * However, we unlock all the pages except @locked_page in case of failure.
+ *
+ * In summary, page locking state will be as follow:
+ *
+ * - page_started == 1 (return value)
+ *     - All the pages are unlocked. IO is started.
+ *     - Note that this can happen only on success
+ * - unlock == 1
+ *     - All the pages except @locked_page are unlocked in any case
+ * - unlock == 0
+ *     - On success, all the pages are locked for writing out them
+ *     - On failure, all the pages except @locked_page are unlocked
+ *
+ * When a failure happens in the second or later iteration of the
+ * while-loop, the ordered extents created in previous iterations are kept
+ * intact. So, the caller must clean them up by calling
+ * btrfs_cleanup_ordered_extents(). See btrfs_run_delalloc_range() for
+ * example.
  */
 static noinline int cow_file_range(struct btrfs_inode *inode,
 				   struct page *locked_page,
@@ -1143,6 +1165,7 @@  static noinline int cow_file_range(struct btrfs_inode *inode,
 	struct btrfs_root *root = inode->root;
 	struct btrfs_fs_info *fs_info = root->fs_info;
 	u64 alloc_hint = 0;
+	u64 orig_start = start;
 	u64 num_bytes;
 	unsigned long ram_size;
 	u64 cur_alloc_size = 0;
@@ -1336,18 +1359,44 @@  static noinline int cow_file_range(struct btrfs_inode *inode,
 	btrfs_dec_block_group_reservations(fs_info, ins.objectid);
 	btrfs_free_reserved_extent(fs_info, ins.objectid, ins.offset, 1);
 out_unlock:
+	/*
+	 * Now, we have three regions to clean up, as shown below.
+	 *
+	 * |-------(1)----|---(2)---|-------------(3)----------|
+	 * `- orig_start  `- start  `- start + cur_alloc_size  `- end
+	 *
+	 * We process each region below.
+	 */
+
 	clear_bits = EXTENT_LOCKED | EXTENT_DELALLOC | EXTENT_DELALLOC_NEW |
 		EXTENT_DEFRAG | EXTENT_CLEAR_META_RESV;
 	page_ops = PAGE_UNLOCK | PAGE_START_WRITEBACK | PAGE_END_WRITEBACK;
+
 	/*
-	 * If we reserved an extent for our delalloc range (or a subrange) and
-	 * failed to create the respective ordered extent, then it means that
-	 * when we reserved the extent we decremented the extent's size from
-	 * the data space_info's bytes_may_use counter and incremented the
-	 * space_info's bytes_reserved counter by the same amount. We must make
-	 * sure extent_clear_unlock_delalloc() does not try to decrement again
-	 * the data space_info's bytes_may_use counter, therefore we do not pass
-	 * it the flag EXTENT_CLEAR_DATA_RESV.
+	 * For the range (1). We have already instantiated the ordered extents
+	 * for this region. They are cleaned up by
+	 * btrfs_cleanup_ordered_extents() in e.g,
+	 * btrfs_run_delalloc_range(). EXTENT_LOCKED | EXTENT_DELALLOC are
+	 * already cleared in the above loop. And, EXTENT_DELALLOC_NEW |
+	 * EXTENT_DEFRAG | EXTENT_CLEAR_META_RESV are handled by the cleanup
+	 * function.
+	 *
+	 * However, in case of unlock == 0, we still need to unlock the pages
+	 * (except @locked_page) to ensure all the pages are unlocked.
+	 */
+	if (!unlock && orig_start < start)
+		extent_clear_unlock_delalloc(inode, orig_start, start - 1,
+					     locked_page, 0, page_ops);
+
+	/*
+	 * For the range (2). If we reserved an extent for our delalloc range
+	 * (or a subrange) and failed to create the respective ordered extent,
+	 * then it means that when we reserved the extent we decremented the
+	 * extent's size from the data space_info's bytes_may_use counter and
+	 * incremented the space_info's bytes_reserved counter by the same
+	 * amount. We must make sure extent_clear_unlock_delalloc() does not try
+	 * to decrement again the data space_info's bytes_may_use counter,
+	 * therefore we do not pass it the flag EXTENT_CLEAR_DATA_RESV.
 	 */
 	if (extent_reserved) {
 		extent_clear_unlock_delalloc(inode, start,
@@ -1359,6 +1408,13 @@  static noinline int cow_file_range(struct btrfs_inode *inode,
 		if (start >= end)
 			goto out;
 	}
+
+	/*
+	 * For the range (3). We never touched the region. In addition to the
+	 * clear_bits above, we add EXTENT_CLEAR_DATA_RESV to release the data
+	 * space_info's bytes_may_use counter, reserved in e.g,
+	 * btrfs_check_data_free_space().
+	 */
 	extent_clear_unlock_delalloc(inode, start, end, locked_page,
 				     clear_bits | EXTENT_CLEAR_DATA_RESV,
 				     page_ops);