diff mbox series

btrfs: ensure pages are unlocked on cow_file_range() failure

Message ID 20211213034338.949507-1-naohiro.aota@wdc.com (mailing list archive)
State New, archived
Headers show
Series btrfs: ensure pages are unlocked on cow_file_range() failure | expand

Commit Message

Naohiro Aota Dec. 13, 2021, 3:43 a.m. UTC
There is a hung_task report regarding page lock on zoned btrfs like below.

https://github.com/naota/linux/issues/59

[  726.328648] INFO: task rocksdb:high0:11085 blocked for more than 241 seconds.
[  726.329839]       Not tainted 5.16.0-rc1+ #1
[  726.330484] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  726.331603] task:rocksdb:high0   state:D stack:    0 pid:11085 ppid: 11082 flags:0x00000000
[  726.331608] Call Trace:
[  726.331611]  <TASK>
[  726.331614]  __schedule+0x2e5/0x9d0
[  726.331622]  schedule+0x58/0xd0
[  726.331626]  io_schedule+0x3f/0x70
[  726.331629]  __folio_lock+0x125/0x200
[  726.331634]  ? find_get_entries+0x1bc/0x240
[  726.331638]  ? filemap_invalidate_unlock_two+0x40/0x40
[  726.331642]  truncate_inode_pages_range+0x5b2/0x770
[  726.331649]  truncate_inode_pages_final+0x44/0x50
[  726.331653]  btrfs_evict_inode+0x67/0x480
[  726.331658]  evict+0xd0/0x180
[  726.331661]  iput+0x13f/0x200
[  726.331664]  do_unlinkat+0x1c0/0x2b0
[  726.331668]  __x64_sys_unlink+0x23/0x30
[  726.331670]  do_syscall_64+0x3b/0xc0
[  726.331674]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  726.331677] RIP: 0033:0x7fb9490a171b
[  726.331681] RSP: 002b:00007fb943ffac68 EFLAGS: 00000246 ORIG_RAX: 0000000000000057
[  726.331684] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fb9490a171b
[  726.331686] RDX: 00007fb943ffb040 RSI: 000055a6bbe6ec20 RDI: 00007fb94400d300
[  726.331687] RBP: 00007fb943ffad00 R08: 0000000000000000 R09: 0000000000000000
[  726.331688] R10: 0000000000000031 R11: 0000000000000246 R12: 00007fb943ffb000
[  726.331690] R13: 00007fb943ffb040 R14: 0000000000000000 R15: 00007fb943ffd260
[  726.331693]  </TASK>

While we debug the issue, we found running fstests generic/551 on 5GB
non-zoned null_blk device in the emulated zoned mode also had a
similar hung issue.

The hang occurs when cow_file_range() fails in the middle of
allocation. cow_file_range() called from do_allocation_zoned() can
split the give region ([start, end]) for allocation depending on
current block group usages. When btrfs can allocate bytes for one part
of the split regions but fails for the other region (e.g. because of
-ENOSPC), we return the error leaving the pages in the succeeded regions
locked. Technically, this occurs only when @unlock == 0. Otherwise, we
unlock the pages in an allocated region after creating an ordered
extent.

Theoretically, the same issue can happen on
submit_uncompressed_range(). However, I could not make it happen even
if I modified the code to go always through
submit_uncompressed_range().

Considering the callers of cow_file_range(unlock=0) won't write out
the pages, we can unlock the pages on error exit from
cow_file_range(). So, we can ensure all the pages except @locked_page
are unlocked on error case.

In summary, cow_file_range now behaves like this:

- page_started == 1 (return value)
  - All the pages are unlocked. IO is started.
- unlock == 0
  - All the pages except @locked_page are unlocked in any case
- unlock == 1
  - On success, all the pages are locked for writing out them
  - On failure, all the pages except @locked_page are unlocked

Fixes: 42c011000963 ("btrfs: zoned: introduce dedicated data write path for zoned filesystems")
CC: stable@vger.kernel.org # 5.12+
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
Theoretically, this can fix a potential issue in
submit_uncompressed_range(). However, I set the stable target only
related to zoned code, because I cannot make compress writes fail on
regular btrfs.
---
 fs/btrfs/inode.c | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

Comments

Filipe Manana Dec. 13, 2021, 10:14 a.m. UTC | #1
On Mon, Dec 13, 2021 at 12:43:38PM +0900, Naohiro Aota wrote:
> There is a hung_task report regarding page lock on zoned btrfs like below.
> 
> https://github.com/naota/linux/issues/59
> 
> [  726.328648] INFO: task rocksdb:high0:11085 blocked for more than 241 seconds.
> [  726.329839]       Not tainted 5.16.0-rc1+ #1
> [  726.330484] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [  726.331603] task:rocksdb:high0   state:D stack:    0 pid:11085 ppid: 11082 flags:0x00000000
> [  726.331608] Call Trace:
> [  726.331611]  <TASK>
> [  726.331614]  __schedule+0x2e5/0x9d0
> [  726.331622]  schedule+0x58/0xd0
> [  726.331626]  io_schedule+0x3f/0x70
> [  726.331629]  __folio_lock+0x125/0x200
> [  726.331634]  ? find_get_entries+0x1bc/0x240
> [  726.331638]  ? filemap_invalidate_unlock_two+0x40/0x40
> [  726.331642]  truncate_inode_pages_range+0x5b2/0x770
> [  726.331649]  truncate_inode_pages_final+0x44/0x50
> [  726.331653]  btrfs_evict_inode+0x67/0x480
> [  726.331658]  evict+0xd0/0x180
> [  726.331661]  iput+0x13f/0x200
> [  726.331664]  do_unlinkat+0x1c0/0x2b0
> [  726.331668]  __x64_sys_unlink+0x23/0x30
> [  726.331670]  do_syscall_64+0x3b/0xc0
> [  726.331674]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> [  726.331677] RIP: 0033:0x7fb9490a171b
> [  726.331681] RSP: 002b:00007fb943ffac68 EFLAGS: 00000246 ORIG_RAX: 0000000000000057
> [  726.331684] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fb9490a171b
> [  726.331686] RDX: 00007fb943ffb040 RSI: 000055a6bbe6ec20 RDI: 00007fb94400d300
> [  726.331687] RBP: 00007fb943ffad00 R08: 0000000000000000 R09: 0000000000000000
> [  726.331688] R10: 0000000000000031 R11: 0000000000000246 R12: 00007fb943ffb000
> [  726.331690] R13: 00007fb943ffb040 R14: 0000000000000000 R15: 00007fb943ffd260
> [  726.331693]  </TASK>
> 
> While we debug the issue, we found running fstests generic/551 on 5GB
> non-zoned null_blk device in the emulated zoned mode also had a
> similar hung issue.
> 
> The hang occurs when cow_file_range() fails in the middle of
> allocation. cow_file_range() called from do_allocation_zoned() can
> split the give region ([start, end]) for allocation depending on
> current block group usages. When btrfs can allocate bytes for one part
> of the split regions but fails for the other region (e.g. because of
> -ENOSPC), we return the error leaving the pages in the succeeded regions
> locked. Technically, this occurs only when @unlock == 0. Otherwise, we
> unlock the pages in an allocated region after creating an ordered
> extent.
> 
> Theoretically, the same issue can happen on
> submit_uncompressed_range(). However, I could not make it happen even
> if I modified the code to go always through
> submit_uncompressed_range().
> 
> Considering the callers of cow_file_range(unlock=0) won't write out
> the pages, we can unlock the pages on error exit from
> cow_file_range(). So, we can ensure all the pages except @locked_page
> are unlocked on error case.
> 
> In summary, cow_file_range now behaves like this:
> 
> - page_started == 1 (return value)
>   - All the pages are unlocked. IO is started.
> - unlock == 0
>   - All the pages except @locked_page are unlocked in any case
> - unlock == 1
>   - On success, all the pages are locked for writing out them
>   - On failure, all the pages except @locked_page are unlocked
> 
> Fixes: 42c011000963 ("btrfs: zoned: introduce dedicated data write path for zoned filesystems")
> CC: stable@vger.kernel.org # 5.12+
> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
> ---
> Theoretically, this can fix a potential issue in
> submit_uncompressed_range(). However, I set the stable target only
> related to zoned code, because I cannot make compress writes fail on
> regular btrfs.
> ---
>  fs/btrfs/inode.c | 21 +++++++++++++++++++++
>  1 file changed, 21 insertions(+)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index b8c911a4a320..e21c94bbc56c 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -1109,6 +1109,22 @@ static u64 get_extent_allocation_hint(struct btrfs_inode *inode, u64 start,
>   * *page_started is set to one if we unlock locked_page and do everything
>   * required to start IO on it.  It may be clean and already done with
>   * IO when we return.
> + *
> + * When unlock == 1, we unlock the pages in successfully allocated regions. We
> + * leave them locked otherwise for writing them out.
> + *
> + * However, we unlock all the pages except @locked_page in case of failure.
> + *
> + * In summary, page locking state will be as follow:
> + *
> + * - page_started == 1 (return value)
> + *     - All the pages are unlocked. IO is started.
> + *     - Note that this can happen only on success
> + * - unlock == 0
> + *     - All the pages except @locked_page are unlocked in any case

That should be the unlock == 1 case.

> + * - unlock == 1
> + *     - On success, all the pages are locked for writing out them
> + *     - On failure, all the pages except @locked_page are unlocked

And that should be the unlock == 0 case.

Otherwise it looks fine.

Thanks.

>   */
>  static noinline int cow_file_range(struct btrfs_inode *inode,
>  				   struct page *locked_page,
> @@ -1118,6 +1134,7 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
>  	struct btrfs_root *root = inode->root;
>  	struct btrfs_fs_info *fs_info = root->fs_info;
>  	u64 alloc_hint = 0;
> +	u64 orig_start = start;
>  	u64 num_bytes;
>  	unsigned long ram_size;
>  	u64 cur_alloc_size = 0;
> @@ -1305,6 +1322,10 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
>  	clear_bits = EXTENT_LOCKED | EXTENT_DELALLOC | EXTENT_DELALLOC_NEW |
>  		EXTENT_DEFRAG | EXTENT_CLEAR_META_RESV;
>  	page_ops = PAGE_UNLOCK | PAGE_START_WRITEBACK | PAGE_END_WRITEBACK;
> +	/* Ensure we unlock all the pages except @locked_page in the error case */
> +	if (!unlock && orig_start < start)
> +		extent_clear_unlock_delalloc(inode, orig_start, start - 1,
> +					     locked_page, clear_bits, page_ops);
>  	/*
>  	 * If we reserved an extent for our delalloc range (or a subrange) and
>  	 * failed to create the respective ordered extent, then it means that
> -- 
> 2.34.1
>
Josef Bacik Dec. 13, 2021, 4:08 p.m. UTC | #2
On Mon, Dec 13, 2021 at 12:43:38PM +0900, Naohiro Aota wrote:
> There is a hung_task report regarding page lock on zoned btrfs like below.
> 
> https://github.com/naota/linux/issues/59
> 
> [  726.328648] INFO: task rocksdb:high0:11085 blocked for more than 241 seconds.
> [  726.329839]       Not tainted 5.16.0-rc1+ #1
> [  726.330484] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [  726.331603] task:rocksdb:high0   state:D stack:    0 pid:11085 ppid: 11082 flags:0x00000000
> [  726.331608] Call Trace:
> [  726.331611]  <TASK>
> [  726.331614]  __schedule+0x2e5/0x9d0
> [  726.331622]  schedule+0x58/0xd0
> [  726.331626]  io_schedule+0x3f/0x70
> [  726.331629]  __folio_lock+0x125/0x200
> [  726.331634]  ? find_get_entries+0x1bc/0x240
> [  726.331638]  ? filemap_invalidate_unlock_two+0x40/0x40
> [  726.331642]  truncate_inode_pages_range+0x5b2/0x770
> [  726.331649]  truncate_inode_pages_final+0x44/0x50
> [  726.331653]  btrfs_evict_inode+0x67/0x480
> [  726.331658]  evict+0xd0/0x180
> [  726.331661]  iput+0x13f/0x200
> [  726.331664]  do_unlinkat+0x1c0/0x2b0
> [  726.331668]  __x64_sys_unlink+0x23/0x30
> [  726.331670]  do_syscall_64+0x3b/0xc0
> [  726.331674]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> [  726.331677] RIP: 0033:0x7fb9490a171b
> [  726.331681] RSP: 002b:00007fb943ffac68 EFLAGS: 00000246 ORIG_RAX: 0000000000000057
> [  726.331684] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fb9490a171b
> [  726.331686] RDX: 00007fb943ffb040 RSI: 000055a6bbe6ec20 RDI: 00007fb94400d300
> [  726.331687] RBP: 00007fb943ffad00 R08: 0000000000000000 R09: 0000000000000000
> [  726.331688] R10: 0000000000000031 R11: 0000000000000246 R12: 00007fb943ffb000
> [  726.331690] R13: 00007fb943ffb040 R14: 0000000000000000 R15: 00007fb943ffd260
> [  726.331693]  </TASK>
> 
> While we debug the issue, we found running fstests generic/551 on 5GB
> non-zoned null_blk device in the emulated zoned mode also had a
> similar hung issue.
> 
> The hang occurs when cow_file_range() fails in the middle of
> allocation. cow_file_range() called from do_allocation_zoned() can
> split the give region ([start, end]) for allocation depending on
> current block group usages. When btrfs can allocate bytes for one part
> of the split regions but fails for the other region (e.g. because of
> -ENOSPC), we return the error leaving the pages in the succeeded regions
> locked. Technically, this occurs only when @unlock == 0. Otherwise, we
> unlock the pages in an allocated region after creating an ordered
> extent.
> 
> Theoretically, the same issue can happen on
> submit_uncompressed_range(). However, I could not make it happen even
> if I modified the code to go always through
> submit_uncompressed_range().
> 
> Considering the callers of cow_file_range(unlock=0) won't write out
> the pages, we can unlock the pages on error exit from
> cow_file_range(). So, we can ensure all the pages except @locked_page
> are unlocked on error case.
> 
> In summary, cow_file_range now behaves like this:
> 
> - page_started == 1 (return value)
>   - All the pages are unlocked. IO is started.
> - unlock == 0
>   - All the pages except @locked_page are unlocked in any case
> - unlock == 1
>   - On success, all the pages are locked for writing out them
>   - On failure, all the pages except @locked_page are unlocked
> 
> Fixes: 42c011000963 ("btrfs: zoned: introduce dedicated data write path for zoned filesystems")
> CC: stable@vger.kernel.org # 5.12+
> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
> ---
> Theoretically, this can fix a potential issue in
> submit_uncompressed_range(). However, I set the stable target only
> related to zoned code, because I cannot make compress writes fail on
> regular btrfs.
> ---

I spent a good deal of time going through this code, and it needs some cleaning
up that can be done separately from this patch.  However this patch isn't quite
right either.

For the unlock == 1 case we'll just leave the ordered extent sitting around,
never submitting it for IO.  So eventually we'll come around and do
btrfs_wait_ordered_extent() and hang.

We need the extent_clear_unlock_delalloc() with START_WRITEBACK and
END_WRITEBACK so the ordered extent cleanup for the one we created gets run, we
just need to not do the PAGE_UNLOCK for that range.

Also you are doing CLEAR_META_RESV here, which isn't correct because that'll be
handled at ordered extent cleanup time.  I'm sort of surprised you didn't get
leak errors with this fix.

There's like 3 different error conditions we want to handle here, all with
subtly different error handling.

1. We didn't do anything, we can just clean everything up.  We can just do

        clear_bits = EXTENT_LOCKED | EXTENT_DELALLOC | EXTENT_DELALLOC_NEW |  
                EXTENT_DEFRAG | EXTENT_CLEAR_META_RESV;
        page_ops = PAGE_UNLOCK | PAGE_START_WRITEBACK | PAGE_END_WRITEBACK;   
        extent_clear_unlock_delalloc(inode, start, end, locked_page,
                                     clear_bits | EXTENT_CLEAR_DATA_RESV,                                  
                                     page_ops);

  This is because we need to clear the META and DATA reservations, and we need
  to unlock all pages (except lock_page) and be done.

2. start == orig_start, but we couldn't create our ordered_extent.  The above
   needs to be called on [start, start+ins.offset-1] without DATA_RESV because
   that was handled by the btrfs_reserve_extent().  Then we need to do the above
   for [start+ins.offset, end] if they aren't equal.

3. Finally your case, start != orig_start, and btrfs_reserve_extent() failed.
   In both the unlock == 1 and unlock == 0 case we have to make sure that we get
   START_WRITEBACK | END_WRITEBACK for the [orig_start, start - 1] range,
   without DATA_RESV or META_RESV as that'll be handled by the ordered extent
   cleanup.  Then we need to do the above with the range [start, end].

I'd like to see the error handling cleaned up, but at the very least your fix is
incomplete.  If you tackle the cleanup that can be separate.  Thanks,

Josef
Naohiro Aota Dec. 14, 2021, 12:21 p.m. UTC | #3
On Mon, Dec 13, 2021 at 10:14:34AM +0000, Filipe Manana wrote:
> On Mon, Dec 13, 2021 at 12:43:38PM +0900, Naohiro Aota wrote:
> > There is a hung_task report regarding page lock on zoned btrfs like below.
> > 
> > https://github.com/naota/linux/issues/59
> > 
> > [  726.328648] INFO: task rocksdb:high0:11085 blocked for more than 241 seconds.
> > [  726.329839]       Not tainted 5.16.0-rc1+ #1
> > [  726.330484] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > [  726.331603] task:rocksdb:high0   state:D stack:    0 pid:11085 ppid: 11082 flags:0x00000000
> > [  726.331608] Call Trace:
> > [  726.331611]  <TASK>
> > [  726.331614]  __schedule+0x2e5/0x9d0
> > [  726.331622]  schedule+0x58/0xd0
> > [  726.331626]  io_schedule+0x3f/0x70
> > [  726.331629]  __folio_lock+0x125/0x200
> > [  726.331634]  ? find_get_entries+0x1bc/0x240
> > [  726.331638]  ? filemap_invalidate_unlock_two+0x40/0x40
> > [  726.331642]  truncate_inode_pages_range+0x5b2/0x770
> > [  726.331649]  truncate_inode_pages_final+0x44/0x50
> > [  726.331653]  btrfs_evict_inode+0x67/0x480
> > [  726.331658]  evict+0xd0/0x180
> > [  726.331661]  iput+0x13f/0x200
> > [  726.331664]  do_unlinkat+0x1c0/0x2b0
> > [  726.331668]  __x64_sys_unlink+0x23/0x30
> > [  726.331670]  do_syscall_64+0x3b/0xc0
> > [  726.331674]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> > [  726.331677] RIP: 0033:0x7fb9490a171b
> > [  726.331681] RSP: 002b:00007fb943ffac68 EFLAGS: 00000246 ORIG_RAX: 0000000000000057
> > [  726.331684] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fb9490a171b
> > [  726.331686] RDX: 00007fb943ffb040 RSI: 000055a6bbe6ec20 RDI: 00007fb94400d300
> > [  726.331687] RBP: 00007fb943ffad00 R08: 0000000000000000 R09: 0000000000000000
> > [  726.331688] R10: 0000000000000031 R11: 0000000000000246 R12: 00007fb943ffb000
> > [  726.331690] R13: 00007fb943ffb040 R14: 0000000000000000 R15: 00007fb943ffd260
> > [  726.331693]  </TASK>
> > 
> > While we debug the issue, we found running fstests generic/551 on 5GB
> > non-zoned null_blk device in the emulated zoned mode also had a
> > similar hung issue.
> > 
> > The hang occurs when cow_file_range() fails in the middle of
> > allocation. cow_file_range() called from do_allocation_zoned() can
> > split the give region ([start, end]) for allocation depending on
> > current block group usages. When btrfs can allocate bytes for one part
> > of the split regions but fails for the other region (e.g. because of
> > -ENOSPC), we return the error leaving the pages in the succeeded regions
> > locked. Technically, this occurs only when @unlock == 0. Otherwise, we
> > unlock the pages in an allocated region after creating an ordered
> > extent.
> > 
> > Theoretically, the same issue can happen on
> > submit_uncompressed_range(). However, I could not make it happen even
> > if I modified the code to go always through
> > submit_uncompressed_range().
> > 
> > Considering the callers of cow_file_range(unlock=0) won't write out
> > the pages, we can unlock the pages on error exit from
> > cow_file_range(). So, we can ensure all the pages except @locked_page
> > are unlocked on error case.
> > 
> > In summary, cow_file_range now behaves like this:
> > 
> > - page_started == 1 (return value)
> >   - All the pages are unlocked. IO is started.
> > - unlock == 0
> >   - All the pages except @locked_page are unlocked in any case
> > - unlock == 1
> >   - On success, all the pages are locked for writing out them
> >   - On failure, all the pages except @locked_page are unlocked
> > 
> > Fixes: 42c011000963 ("btrfs: zoned: introduce dedicated data write path for zoned filesystems")
> > CC: stable@vger.kernel.org # 5.12+
> > Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
> > ---
> > Theoretically, this can fix a potential issue in
> > submit_uncompressed_range(). However, I set the stable target only
> > related to zoned code, because I cannot make compress writes fail on
> > regular btrfs.
> > ---
> >  fs/btrfs/inode.c | 21 +++++++++++++++++++++
> >  1 file changed, 21 insertions(+)
> > 
> > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> > index b8c911a4a320..e21c94bbc56c 100644
> > --- a/fs/btrfs/inode.c
> > +++ b/fs/btrfs/inode.c
> > @@ -1109,6 +1109,22 @@ static u64 get_extent_allocation_hint(struct btrfs_inode *inode, u64 start,
> >   * *page_started is set to one if we unlock locked_page and do everything
> >   * required to start IO on it.  It may be clean and already done with
> >   * IO when we return.
> > + *
> > + * When unlock == 1, we unlock the pages in successfully allocated regions. We
> > + * leave them locked otherwise for writing them out.
> > + *
> > + * However, we unlock all the pages except @locked_page in case of failure.
> > + *
> > + * In summary, page locking state will be as follow:
> > + *
> > + * - page_started == 1 (return value)
> > + *     - All the pages are unlocked. IO is started.
> > + *     - Note that this can happen only on success
> > + * - unlock == 0
> > + *     - All the pages except @locked_page are unlocked in any case
> 
> That should be the unlock == 1 case.
> 
> > + * - unlock == 1
> > + *     - On success, all the pages are locked for writing out them
> > + *     - On failure, all the pages except @locked_page are unlocked
> 
> And that should be the unlock == 0 case.
> 
> Otherwise it looks fine.

Oops, thank you for catching this. I must have screwed up while
rewriting from "unlock != 0" style (which is technically correct).

> Thanks.
> 
> >   */
> >  static noinline int cow_file_range(struct btrfs_inode *inode,
> >  				   struct page *locked_page,
> > @@ -1118,6 +1134,7 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
> >  	struct btrfs_root *root = inode->root;
> >  	struct btrfs_fs_info *fs_info = root->fs_info;
> >  	u64 alloc_hint = 0;
> > +	u64 orig_start = start;
> >  	u64 num_bytes;
> >  	unsigned long ram_size;
> >  	u64 cur_alloc_size = 0;
> > @@ -1305,6 +1322,10 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
> >  	clear_bits = EXTENT_LOCKED | EXTENT_DELALLOC | EXTENT_DELALLOC_NEW |
> >  		EXTENT_DEFRAG | EXTENT_CLEAR_META_RESV;
> >  	page_ops = PAGE_UNLOCK | PAGE_START_WRITEBACK | PAGE_END_WRITEBACK;
> > +	/* Ensure we unlock all the pages except @locked_page in the error case */
> > +	if (!unlock && orig_start < start)
> > +		extent_clear_unlock_delalloc(inode, orig_start, start - 1,
> > +					     locked_page, clear_bits, page_ops);
> >  	/*
> >  	 * If we reserved an extent for our delalloc range (or a subrange) and
> >  	 * failed to create the respective ordered extent, then it means that
> > -- 
> > 2.34.1
> >
Naohiro Aota Dec. 14, 2021, 1:04 p.m. UTC | #4
On Mon, Dec 13, 2021 at 11:08:07AM -0500, Josef Bacik wrote:
> On Mon, Dec 13, 2021 at 12:43:38PM +0900, Naohiro Aota wrote:
> > There is a hung_task report regarding page lock on zoned btrfs like below.
> > 
> > https://github.com/naota/linux/issues/59
> > 
> > [  726.328648] INFO: task rocksdb:high0:11085 blocked for more than 241 seconds.
> > [  726.329839]       Not tainted 5.16.0-rc1+ #1
> > [  726.330484] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > [  726.331603] task:rocksdb:high0   state:D stack:    0 pid:11085 ppid: 11082 flags:0x00000000
> > [  726.331608] Call Trace:
> > [  726.331611]  <TASK>
> > [  726.331614]  __schedule+0x2e5/0x9d0
> > [  726.331622]  schedule+0x58/0xd0
> > [  726.331626]  io_schedule+0x3f/0x70
> > [  726.331629]  __folio_lock+0x125/0x200
> > [  726.331634]  ? find_get_entries+0x1bc/0x240
> > [  726.331638]  ? filemap_invalidate_unlock_two+0x40/0x40
> > [  726.331642]  truncate_inode_pages_range+0x5b2/0x770
> > [  726.331649]  truncate_inode_pages_final+0x44/0x50
> > [  726.331653]  btrfs_evict_inode+0x67/0x480
> > [  726.331658]  evict+0xd0/0x180
> > [  726.331661]  iput+0x13f/0x200
> > [  726.331664]  do_unlinkat+0x1c0/0x2b0
> > [  726.331668]  __x64_sys_unlink+0x23/0x30
> > [  726.331670]  do_syscall_64+0x3b/0xc0
> > [  726.331674]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> > [  726.331677] RIP: 0033:0x7fb9490a171b
> > [  726.331681] RSP: 002b:00007fb943ffac68 EFLAGS: 00000246 ORIG_RAX: 0000000000000057
> > [  726.331684] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fb9490a171b
> > [  726.331686] RDX: 00007fb943ffb040 RSI: 000055a6bbe6ec20 RDI: 00007fb94400d300
> > [  726.331687] RBP: 00007fb943ffad00 R08: 0000000000000000 R09: 0000000000000000
> > [  726.331688] R10: 0000000000000031 R11: 0000000000000246 R12: 00007fb943ffb000
> > [  726.331690] R13: 00007fb943ffb040 R14: 0000000000000000 R15: 00007fb943ffd260
> > [  726.331693]  </TASK>
> > 
> > While we debug the issue, we found running fstests generic/551 on 5GB
> > non-zoned null_blk device in the emulated zoned mode also had a
> > similar hung issue.
> > 
> > The hang occurs when cow_file_range() fails in the middle of
> > allocation. cow_file_range() called from do_allocation_zoned() can
> > split the give region ([start, end]) for allocation depending on
> > current block group usages. When btrfs can allocate bytes for one part
> > of the split regions but fails for the other region (e.g. because of
> > -ENOSPC), we return the error leaving the pages in the succeeded regions
> > locked. Technically, this occurs only when @unlock == 0. Otherwise, we
> > unlock the pages in an allocated region after creating an ordered
> > extent.
> > 
> > Theoretically, the same issue can happen on
> > submit_uncompressed_range(). However, I could not make it happen even
> > if I modified the code to go always through
> > submit_uncompressed_range().
> > 
> > Considering the callers of cow_file_range(unlock=0) won't write out
> > the pages, we can unlock the pages on error exit from
> > cow_file_range(). So, we can ensure all the pages except @locked_page
> > are unlocked on error case.
> > 
> > In summary, cow_file_range now behaves like this:
> > 
> > - page_started == 1 (return value)
> >   - All the pages are unlocked. IO is started.
> > - unlock == 0
> >   - All the pages except @locked_page are unlocked in any case
> > - unlock == 1
> >   - On success, all the pages are locked for writing out them
> >   - On failure, all the pages except @locked_page are unlocked

Sorry, I screwed up unlock == 0 vs unlock == 1 as Filipe is already
pointed out.

> > Fixes: 42c011000963 ("btrfs: zoned: introduce dedicated data write path for zoned filesystems")
> > CC: stable@vger.kernel.org # 5.12+
> > Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
> > ---
> > Theoretically, this can fix a potential issue in
> > submit_uncompressed_range(). However, I set the stable target only
> > related to zoned code, because I cannot make compress writes fail on
> > regular btrfs.
> > ---
> 
> I spent a good deal of time going through this code, and it needs some cleaning
> up that can be done separately from this patch.  However this patch isn't quite
> right either.
> 
> For the unlock == 1 case we'll just leave the ordered extent sitting around,
> never submitting it for IO.  So eventually we'll come around and do
> btrfs_wait_ordered_extent() and hang.

Isn't the ordered extent handled by btrfs_cleanup_ordered_extents() in
btrfs_run_delalloc_range() (both for unlock == 0 or 1)? Or, you mean
this should be also handled in cow_file_range()?

On the submit_uncompressed_range() case (unlock == 0), we don't call
btrfs_cleanup_ordered_extents() so this part can be problem.

> We need the extent_clear_unlock_delalloc() with START_WRITEBACK and
> END_WRITEBACK so the ordered extent cleanup for the one we created gets run, we
> just need to not do the PAGE_UNLOCK for that range.

I see. So, we need START_WRITEBACK and END_WRITEBACK both on unlock ==
0 and unlock == 1.

> Also you are doing CLEAR_META_RESV here, which isn't correct because that'll be
> handled at ordered extent cleanup time.  I'm sort of surprised you didn't get
> leak errors with this fix.

Ah, I missed this point. Maybe because it only happens on a specific
condition (e.g, with a faulty ENOSPC for zoned code).

> There's like 3 different error conditions we want to handle here, all with
> subtly different error handling.
> 
> 1. We didn't do anything, we can just clean everything up.  We can just do
> 
>         clear_bits = EXTENT_LOCKED | EXTENT_DELALLOC | EXTENT_DELALLOC_NEW |  
>                 EXTENT_DEFRAG | EXTENT_CLEAR_META_RESV;
>         page_ops = PAGE_UNLOCK | PAGE_START_WRITEBACK | PAGE_END_WRITEBACK;   
>         extent_clear_unlock_delalloc(inode, start, end, locked_page,
>                                      clear_bits | EXTENT_CLEAR_DATA_RESV,                                  
>                                      page_ops);
> 
>   This is because we need to clear the META and DATA reservations, and we need
>   to unlock all pages (except lock_page) and be done.
> 
> 2. start == orig_start, but we couldn't create our ordered_extent.  The above
>    needs to be called on [start, start+ins.offset-1] without DATA_RESV because
>    that was handled by the btrfs_reserve_extent().  Then we need to do the above
>    for [start+ins.offset, end] if they aren't equal.
> 
> 3. Finally your case, start != orig_start, and btrfs_reserve_extent() failed.
>    In both the unlock == 1 and unlock == 0 case we have to make sure that we get
>    START_WRITEBACK | END_WRITEBACK for the [orig_start, start - 1] range,
>    without DATA_RESV or META_RESV as that'll be handled by the ordered extent
>    cleanup.  Then we need to do the above with the range [start, end].
> 
> I'd like to see the error handling cleaned up, but at the very least your fix is
> incomplete.  If you tackle the cleanup that can be separate.  Thanks,
> 
> Josef

Thank you for clarifying the cases. I'll try to clean it up.
Johannes Thumshirn May 19, 2022, 12:24 p.m. UTC | #5
What's the status of this patch? It fixes actual errors 
(hung_tasks) for me.

On 13/12/2021 04:43, Naohiro Aota wrote:
> There is a hung_task report regarding page lock on zoned btrfs like below.
> 
> https://github.com/naota/linux/issues/59
> 
> [  726.328648] INFO: task rocksdb:high0:11085 blocked for more than 241 seconds.
> [  726.329839]       Not tainted 5.16.0-rc1+ #1
> [  726.330484] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [  726.331603] task:rocksdb:high0   state:D stack:    0 pid:11085 ppid: 11082 flags:0x00000000
> [  726.331608] Call Trace:
> [  726.331611]  <TASK>
> [  726.331614]  __schedule+0x2e5/0x9d0
> [  726.331622]  schedule+0x58/0xd0
> [  726.331626]  io_schedule+0x3f/0x70
> [  726.331629]  __folio_lock+0x125/0x200
> [  726.331634]  ? find_get_entries+0x1bc/0x240
> [  726.331638]  ? filemap_invalidate_unlock_two+0x40/0x40
> [  726.331642]  truncate_inode_pages_range+0x5b2/0x770
> [  726.331649]  truncate_inode_pages_final+0x44/0x50
> [  726.331653]  btrfs_evict_inode+0x67/0x480
> [  726.331658]  evict+0xd0/0x180
> [  726.331661]  iput+0x13f/0x200
> [  726.331664]  do_unlinkat+0x1c0/0x2b0
> [  726.331668]  __x64_sys_unlink+0x23/0x30
> [  726.331670]  do_syscall_64+0x3b/0xc0
> [  726.331674]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> [  726.331677] RIP: 0033:0x7fb9490a171b
> [  726.331681] RSP: 002b:00007fb943ffac68 EFLAGS: 00000246 ORIG_RAX: 0000000000000057
> [  726.331684] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fb9490a171b
> [  726.331686] RDX: 00007fb943ffb040 RSI: 000055a6bbe6ec20 RDI: 00007fb94400d300
> [  726.331687] RBP: 00007fb943ffad00 R08: 0000000000000000 R09: 0000000000000000
> [  726.331688] R10: 0000000000000031 R11: 0000000000000246 R12: 00007fb943ffb000
> [  726.331690] R13: 00007fb943ffb040 R14: 0000000000000000 R15: 00007fb943ffd260
> [  726.331693]  </TASK>
> 
> While we debug the issue, we found running fstests generic/551 on 5GB
> non-zoned null_blk device in the emulated zoned mode also had a
> similar hung issue.
> 
> The hang occurs when cow_file_range() fails in the middle of
> allocation. cow_file_range() called from do_allocation_zoned() can
> split the give region ([start, end]) for allocation depending on
> current block group usages. When btrfs can allocate bytes for one part
> of the split regions but fails for the other region (e.g. because of
> -ENOSPC), we return the error leaving the pages in the succeeded regions
> locked. Technically, this occurs only when @unlock == 0. Otherwise, we
> unlock the pages in an allocated region after creating an ordered
> extent.
> 
> Theoretically, the same issue can happen on
> submit_uncompressed_range(). However, I could not make it happen even
> if I modified the code to go always through
> submit_uncompressed_range().
> 
> Considering the callers of cow_file_range(unlock=0) won't write out
> the pages, we can unlock the pages on error exit from
> cow_file_range(). So, we can ensure all the pages except @locked_page
> are unlocked on error case.
> 
> In summary, cow_file_range now behaves like this:
> 
> - page_started == 1 (return value)
>   - All the pages are unlocked. IO is started.
> - unlock == 0
>   - All the pages except @locked_page are unlocked in any case
> - unlock == 1
>   - On success, all the pages are locked for writing out them
>   - On failure, all the pages except @locked_page are unlocked
>
Filipe Manana May 19, 2022, 1:38 p.m. UTC | #6
On Thu, May 19, 2022 at 12:24:00PM +0000, Johannes Thumshirn wrote:
> What's the status of this patch? It fixes actual errors 
> (hung_tasks) for me.

Well, there was previous review about it, and nothing was addressed in the
meanwhile.

It may happen to fix the hang for some test case for fstests on zoned mode.
But there's a fundamental problem with the error handling as Josef pointed
before - this is an existing problem, and not a problem with this patch or
exclusive to zoned mode.

The problem is if we fail on the second iteration of the while loop, when
reserving an extent for example, we leave the loop and with an ordered
extent created in the previous iteration, and return an error to the caller.
After that we end up never submitting a bio for that ordered extent's range,
which means we end up with an ordered extent that will never be completed.
So something like an fsync after that will hang forever for example, or
anything else calling btrfs_wait_ordered_range().

So on error we need to go through previously created ordered extents, set
the IOERR flag on them, complete them to wake up any waiters and remove it,
which also takes care or adding the reserved extent back to the free space
cache/tree.



> 
> On 13/12/2021 04:43, Naohiro Aota wrote:
> > There is a hung_task report regarding page lock on zoned btrfs like below.
> > 
> > https://github.com/naota/linux/issues/59
> > 
> > [  726.328648] INFO: task rocksdb:high0:11085 blocked for more than 241 seconds.
> > [  726.329839]       Not tainted 5.16.0-rc1+ #1
> > [  726.330484] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > [  726.331603] task:rocksdb:high0   state:D stack:    0 pid:11085 ppid: 11082 flags:0x00000000
> > [  726.331608] Call Trace:
> > [  726.331611]  <TASK>
> > [  726.331614]  __schedule+0x2e5/0x9d0
> > [  726.331622]  schedule+0x58/0xd0
> > [  726.331626]  io_schedule+0x3f/0x70
> > [  726.331629]  __folio_lock+0x125/0x200
> > [  726.331634]  ? find_get_entries+0x1bc/0x240
> > [  726.331638]  ? filemap_invalidate_unlock_two+0x40/0x40
> > [  726.331642]  truncate_inode_pages_range+0x5b2/0x770
> > [  726.331649]  truncate_inode_pages_final+0x44/0x50
> > [  726.331653]  btrfs_evict_inode+0x67/0x480
> > [  726.331658]  evict+0xd0/0x180
> > [  726.331661]  iput+0x13f/0x200
> > [  726.331664]  do_unlinkat+0x1c0/0x2b0
> > [  726.331668]  __x64_sys_unlink+0x23/0x30
> > [  726.331670]  do_syscall_64+0x3b/0xc0
> > [  726.331674]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> > [  726.331677] RIP: 0033:0x7fb9490a171b
> > [  726.331681] RSP: 002b:00007fb943ffac68 EFLAGS: 00000246 ORIG_RAX: 0000000000000057
> > [  726.331684] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fb9490a171b
> > [  726.331686] RDX: 00007fb943ffb040 RSI: 000055a6bbe6ec20 RDI: 00007fb94400d300
> > [  726.331687] RBP: 00007fb943ffad00 R08: 0000000000000000 R09: 0000000000000000
> > [  726.331688] R10: 0000000000000031 R11: 0000000000000246 R12: 00007fb943ffb000
> > [  726.331690] R13: 00007fb943ffb040 R14: 0000000000000000 R15: 00007fb943ffd260
> > [  726.331693]  </TASK>
> > 
> > While we debug the issue, we found running fstests generic/551 on 5GB
> > non-zoned null_blk device in the emulated zoned mode also had a
> > similar hung issue.
> > 
> > The hang occurs when cow_file_range() fails in the middle of
> > allocation. cow_file_range() called from do_allocation_zoned() can
> > split the give region ([start, end]) for allocation depending on
> > current block group usages. When btrfs can allocate bytes for one part
> > of the split regions but fails for the other region (e.g. because of
> > -ENOSPC), we return the error leaving the pages in the succeeded regions
> > locked. Technically, this occurs only when @unlock == 0. Otherwise, we
> > unlock the pages in an allocated region after creating an ordered
> > extent.
> > 
> > Theoretically, the same issue can happen on
> > submit_uncompressed_range(). However, I could not make it happen even
> > if I modified the code to go always through
> > submit_uncompressed_range().
> > 
> > Considering the callers of cow_file_range(unlock=0) won't write out
> > the pages, we can unlock the pages on error exit from
> > cow_file_range(). So, we can ensure all the pages except @locked_page
> > are unlocked on error case.
> > 
> > In summary, cow_file_range now behaves like this:
> > 
> > - page_started == 1 (return value)
> >   - All the pages are unlocked. IO is started.
> > - unlock == 0
> >   - All the pages except @locked_page are unlocked in any case
> > - unlock == 1
> >   - On success, all the pages are locked for writing out them
> >   - On failure, all the pages except @locked_page are unlocked
> >
Johannes Thumshirn May 19, 2022, 1:51 p.m. UTC | #7
On 19/05/2022 15:39, Filipe Manana wrote:
> On Thu, May 19, 2022 at 12:24:00PM +0000, Johannes Thumshirn wrote:
>> What's the status of this patch? It fixes actual errors 
>> (hung_tasks) for me.
> Well, there was previous review about it, and nothing was addressed in the
> meanwhile.

The question was about the general status of it, not if we're going to merge
it. I know Josef's reply.
David Sterba May 19, 2022, 4:18 p.m. UTC | #8
On Thu, May 19, 2022 at 01:51:34PM +0000, Johannes Thumshirn wrote:
> On 19/05/2022 15:39, Filipe Manana wrote:
> > On Thu, May 19, 2022 at 12:24:00PM +0000, Johannes Thumshirn wrote:
> >> What's the status of this patch? It fixes actual errors 
> >> (hung_tasks) for me.
> > Well, there was previous review about it, and nothing was addressed in the
> > meanwhile.
> 
> The question was about the general status of it, not if we're going to merge
> it. I know Josef's reply.

If I see comments that do not suggest there are only simple fixups
needed I'm waiting either for a discussion to resolve the questions or
an updated v2. We want a fix for the reported hang but the fix should be
complete.
Naohiro Aota June 13, 2022, 12:21 p.m. UTC | #9
On Thu, May 19, 2022 at 02:38:50PM +0100, Filipe Manana wrote:
> On Thu, May 19, 2022 at 12:24:00PM +0000, Johannes Thumshirn wrote:
> > What's the status of this patch? It fixes actual errors 
> > (hung_tasks) for me.
> 
> Well, there was previous review about it, and nothing was addressed in the
> meanwhile.
> 
> It may happen to fix the hang for some test case for fstests on zoned mode.
> But there's a fundamental problem with the error handling as Josef pointed
> before - this is an existing problem, and not a problem with this patch or
> exclusive to zoned mode.
> 
> The problem is if we fail on the second iteration of the while loop, when
> reserving an extent for example, we leave the loop and with an ordered
> extent created in the previous iteration, and return an error to the caller.
> After that we end up never submitting a bio for that ordered extent's range,
> which means we end up with an ordered extent that will never be completed.
> So something like an fsync after that will hang forever for example, or
> anything else calling btrfs_wait_ordered_range().

I'm recently revisiting this patch and I think this is not true. The
ordered extents instantiated in the previous loops (before an error by
btrfs_reserve_extent) are then cleaned up by
btrfs_cleanup_ordered_extents() calling in btrfs_run_delalloc_range(), no?
So, btrfs_wait_ordered_range() never hang for the ordered extents.

Well, there is a path not calling btrfs_cleanup_ordered_extents(), which is
submit_uncompressed_range(). So, this path needs to call it.

Or should we do the clean-up within cow_file_range()? It is more
understandable to clean the extents where it is created in the error
case. But then, run_delalloc_nocow() should also clean the ordered extents
created by itself (BTRFS_ORDERED_PREALLOC or BTRFS_ORDERED_NOCOW extents).

> So on error we need to go through previously created ordered extents, set
> the IOERR flag on them, complete them to wake up any waiters and remove it,
> which also takes care or adding the reserved extent back to the free space
> cache/tree.

I think this is exactly done by btrfs_cleanup_ordered_extents() +
end_extent_writepage() in __extent_writepage(). No?

So, in the end, we just need to unlock the pages except locked_page in the
error case.

> 
> 
> > 
> > On 13/12/2021 04:43, Naohiro Aota wrote:
> > > There is a hung_task report regarding page lock on zoned btrfs like below.
> > > 
> > > https://github.com/naota/linux/issues/59
> > > 
> > > [  726.328648] INFO: task rocksdb:high0:11085 blocked for more than 241 seconds.
> > > [  726.329839]       Not tainted 5.16.0-rc1+ #1
> > > [  726.330484] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > [  726.331603] task:rocksdb:high0   state:D stack:    0 pid:11085 ppid: 11082 flags:0x00000000
> > > [  726.331608] Call Trace:
> > > [  726.331611]  <TASK>
> > > [  726.331614]  __schedule+0x2e5/0x9d0
> > > [  726.331622]  schedule+0x58/0xd0
> > > [  726.331626]  io_schedule+0x3f/0x70
> > > [  726.331629]  __folio_lock+0x125/0x200
> > > [  726.331634]  ? find_get_entries+0x1bc/0x240
> > > [  726.331638]  ? filemap_invalidate_unlock_two+0x40/0x40
> > > [  726.331642]  truncate_inode_pages_range+0x5b2/0x770
> > > [  726.331649]  truncate_inode_pages_final+0x44/0x50
> > > [  726.331653]  btrfs_evict_inode+0x67/0x480
> > > [  726.331658]  evict+0xd0/0x180
> > > [  726.331661]  iput+0x13f/0x200
> > > [  726.331664]  do_unlinkat+0x1c0/0x2b0
> > > [  726.331668]  __x64_sys_unlink+0x23/0x30
> > > [  726.331670]  do_syscall_64+0x3b/0xc0
> > > [  726.331674]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> > > [  726.331677] RIP: 0033:0x7fb9490a171b
> > > [  726.331681] RSP: 002b:00007fb943ffac68 EFLAGS: 00000246 ORIG_RAX: 0000000000000057
> > > [  726.331684] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fb9490a171b
> > > [  726.331686] RDX: 00007fb943ffb040 RSI: 000055a6bbe6ec20 RDI: 00007fb94400d300
> > > [  726.331687] RBP: 00007fb943ffad00 R08: 0000000000000000 R09: 0000000000000000
> > > [  726.331688] R10: 0000000000000031 R11: 0000000000000246 R12: 00007fb943ffb000
> > > [  726.331690] R13: 00007fb943ffb040 R14: 0000000000000000 R15: 00007fb943ffd260
> > > [  726.331693]  </TASK>
> > > 
> > > While we debug the issue, we found running fstests generic/551 on 5GB
> > > non-zoned null_blk device in the emulated zoned mode also had a
> > > similar hung issue.
> > > 
> > > The hang occurs when cow_file_range() fails in the middle of
> > > allocation. cow_file_range() called from do_allocation_zoned() can
> > > split the give region ([start, end]) for allocation depending on
> > > current block group usages. When btrfs can allocate bytes for one part
> > > of the split regions but fails for the other region (e.g. because of
> > > -ENOSPC), we return the error leaving the pages in the succeeded regions
> > > locked. Technically, this occurs only when @unlock == 0. Otherwise, we
> > > unlock the pages in an allocated region after creating an ordered
> > > extent.
> > > 
> > > Theoretically, the same issue can happen on
> > > submit_uncompressed_range(). However, I could not make it happen even
> > > if I modified the code to go always through
> > > submit_uncompressed_range().
> > > 
> > > Considering the callers of cow_file_range(unlock=0) won't write out
> > > the pages, we can unlock the pages on error exit from
> > > cow_file_range(). So, we can ensure all the pages except @locked_page
> > > are unlocked on error case.
> > > 
> > > In summary, cow_file_range now behaves like this:
> > > 
> > > - page_started == 1 (return value)
> > >   - All the pages are unlocked. IO is started.
> > > - unlock == 0
> > >   - All the pages except @locked_page are unlocked in any case
> > > - unlock == 1
> > >   - On success, all the pages are locked for writing out them
> > >   - On failure, all the pages except @locked_page are unlocked
> > >
Filipe Manana June 13, 2022, 3:33 p.m. UTC | #10
On Mon, Jun 13, 2022 at 12:21:11PM +0000, Naohiro Aota wrote:
> On Thu, May 19, 2022 at 02:38:50PM +0100, Filipe Manana wrote:
> > On Thu, May 19, 2022 at 12:24:00PM +0000, Johannes Thumshirn wrote:
> > > What's the status of this patch? It fixes actual errors 
> > > (hung_tasks) for me.
> > 
> > Well, there was previous review about it, and nothing was addressed in the
> > meanwhile.
> > 
> > It may happen to fix the hang for some test case for fstests on zoned mode.
> > But there's a fundamental problem with the error handling as Josef pointed
> > before - this is an existing problem, and not a problem with this patch or
> > exclusive to zoned mode.
> > 
> > The problem is if we fail on the second iteration of the while loop, when
> > reserving an extent for example, we leave the loop and with an ordered
> > extent created in the previous iteration, and return an error to the caller.
> > After that we end up never submitting a bio for that ordered extent's range,
> > which means we end up with an ordered extent that will never be completed.
> > So something like an fsync after that will hang forever for example, or
> > anything else calling btrfs_wait_ordered_range().
> 
> I'm recently revisiting this patch and I think this is not true. The
> ordered extents instantiated in the previous loops (before an error by
> btrfs_reserve_extent) are then cleaned up by
> btrfs_cleanup_ordered_extents() calling in btrfs_run_delalloc_range(), no?
> So, btrfs_wait_ordered_range() never hang for the ordered extents.

You are right, we are doing the proper cleanup of ordered extents.
I missed that we had this function btrfs_cleanup_ordered_extents() that is
run for all delalloc cases. In older days we had that only for the error
path of direct IO writes, but it was later refactored and added for the
dealloc cases too. And I had the very old days in my mind, sorry.

> 
> Well, there is a path not calling btrfs_cleanup_ordered_extents(), which is
> submit_uncompressed_range(). So, this path needs to call it.

Sounds right.

> 
> Or should we do the clean-up within cow_file_range()? It is more
> understandable to clean the extents where it is created in the error
> case. But then, run_delalloc_nocow() should also clean the ordered extents
> created by itself (BTRFS_ORDERED_PREALLOC or BTRFS_ORDERED_NOCOW extents).

I think it's fine as it is.
In case it confuses someone, we could just leave a comment at cow_file_range(),
etc, telling we do cleanup of previously created ordered extents at
btrfs_run_delalloc_range().


> 
> > So on error we need to go through previously created ordered extents, set
> > the IOERR flag on them, complete them to wake up any waiters and remove it,
> > which also takes care or adding the reserved extent back to the free space
> > cache/tree.
> 
> I think this is exactly done by btrfs_cleanup_ordered_extents() +
> end_extent_writepage() in __extent_writepage(). No?
> 
> So, in the end, we just need to unlock the pages except locked_page in the
> error case.

So I'm back to my initial review, where everything seems fine except for
the small mistake in the comment.

Thanks.

> 
> > 
> > 
> > > 
> > > On 13/12/2021 04:43, Naohiro Aota wrote:
> > > > There is a hung_task report regarding page lock on zoned btrfs like below.
> > > > 
> > > > https://github.com/naota/linux/issues/59
> > > > 
> > > > [  726.328648] INFO: task rocksdb:high0:11085 blocked for more than 241 seconds.
> > > > [  726.329839]       Not tainted 5.16.0-rc1+ #1
> > > > [  726.330484] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > > [  726.331603] task:rocksdb:high0   state:D stack:    0 pid:11085 ppid: 11082 flags:0x00000000
> > > > [  726.331608] Call Trace:
> > > > [  726.331611]  <TASK>
> > > > [  726.331614]  __schedule+0x2e5/0x9d0
> > > > [  726.331622]  schedule+0x58/0xd0
> > > > [  726.331626]  io_schedule+0x3f/0x70
> > > > [  726.331629]  __folio_lock+0x125/0x200
> > > > [  726.331634]  ? find_get_entries+0x1bc/0x240
> > > > [  726.331638]  ? filemap_invalidate_unlock_two+0x40/0x40
> > > > [  726.331642]  truncate_inode_pages_range+0x5b2/0x770
> > > > [  726.331649]  truncate_inode_pages_final+0x44/0x50
> > > > [  726.331653]  btrfs_evict_inode+0x67/0x480
> > > > [  726.331658]  evict+0xd0/0x180
> > > > [  726.331661]  iput+0x13f/0x200
> > > > [  726.331664]  do_unlinkat+0x1c0/0x2b0
> > > > [  726.331668]  __x64_sys_unlink+0x23/0x30
> > > > [  726.331670]  do_syscall_64+0x3b/0xc0
> > > > [  726.331674]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> > > > [  726.331677] RIP: 0033:0x7fb9490a171b
> > > > [  726.331681] RSP: 002b:00007fb943ffac68 EFLAGS: 00000246 ORIG_RAX: 0000000000000057
> > > > [  726.331684] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fb9490a171b
> > > > [  726.331686] RDX: 00007fb943ffb040 RSI: 000055a6bbe6ec20 RDI: 00007fb94400d300
> > > > [  726.331687] RBP: 00007fb943ffad00 R08: 0000000000000000 R09: 0000000000000000
> > > > [  726.331688] R10: 0000000000000031 R11: 0000000000000246 R12: 00007fb943ffb000
> > > > [  726.331690] R13: 00007fb943ffb040 R14: 0000000000000000 R15: 00007fb943ffd260
> > > > [  726.331693]  </TASK>
> > > > 
> > > > While we debug the issue, we found running fstests generic/551 on 5GB
> > > > non-zoned null_blk device in the emulated zoned mode also had a
> > > > similar hung issue.
> > > > 
> > > > The hang occurs when cow_file_range() fails in the middle of
> > > > allocation. cow_file_range() called from do_allocation_zoned() can
> > > > split the give region ([start, end]) for allocation depending on
> > > > current block group usages. When btrfs can allocate bytes for one part
> > > > of the split regions but fails for the other region (e.g. because of
> > > > -ENOSPC), we return the error leaving the pages in the succeeded regions
> > > > locked. Technically, this occurs only when @unlock == 0. Otherwise, we
> > > > unlock the pages in an allocated region after creating an ordered
> > > > extent.
> > > > 
> > > > Theoretically, the same issue can happen on
> > > > submit_uncompressed_range(). However, I could not make it happen even
> > > > if I modified the code to go always through
> > > > submit_uncompressed_range().
> > > > 
> > > > Considering the callers of cow_file_range(unlock=0) won't write out
> > > > the pages, we can unlock the pages on error exit from
> > > > cow_file_range(). So, we can ensure all the pages except @locked_page
> > > > are unlocked on error case.
> > > > 
> > > > In summary, cow_file_range now behaves like this:
> > > > 
> > > > - page_started == 1 (return value)
> > > >   - All the pages are unlocked. IO is started.
> > > > - unlock == 0
> > > >   - All the pages except @locked_page are unlocked in any case
> > > > - unlock == 1
> > > >   - On success, all the pages are locked for writing out them
> > > >   - On failure, all the pages except @locked_page are unlocked
> > > >
Naohiro Aota June 14, 2022, 3 a.m. UTC | #11
On Mon, Jun 13, 2022 at 04:33:18PM +0100, Filipe Manana wrote:
> On Mon, Jun 13, 2022 at 12:21:11PM +0000, Naohiro Aota wrote:
> > On Thu, May 19, 2022 at 02:38:50PM +0100, Filipe Manana wrote:
> > > On Thu, May 19, 2022 at 12:24:00PM +0000, Johannes Thumshirn wrote:
> > > > What's the status of this patch? It fixes actual errors 
> > > > (hung_tasks) for me.
> > > 
> > > Well, there was previous review about it, and nothing was addressed in the
> > > meanwhile.
> > > 
> > > It may happen to fix the hang for some test case for fstests on zoned mode.
> > > But there's a fundamental problem with the error handling as Josef pointed
> > > before - this is an existing problem, and not a problem with this patch or
> > > exclusive to zoned mode.
> > > 
> > > The problem is if we fail on the second iteration of the while loop, when
> > > reserving an extent for example, we leave the loop and with an ordered
> > > extent created in the previous iteration, and return an error to the caller.
> > > After that we end up never submitting a bio for that ordered extent's range,
> > > which means we end up with an ordered extent that will never be completed.
> > > So something like an fsync after that will hang forever for example, or
> > > anything else calling btrfs_wait_ordered_range().
> > 
> > I'm recently revisiting this patch and I think this is not true. The
> > ordered extents instantiated in the previous loops (before an error by
> > btrfs_reserve_extent) are then cleaned up by
> > btrfs_cleanup_ordered_extents() calling in btrfs_run_delalloc_range(), no?
> > So, btrfs_wait_ordered_range() never hang for the ordered extents.
> 
> You are right, we are doing the proper cleanup of ordered extents.
> I missed that we had this function btrfs_cleanup_ordered_extents() that is
> run for all delalloc cases. In older days we had that only for the error
> path of direct IO writes, but it was later refactored and added for the
> dealloc cases too. And I had the very old days in my mind, sorry.
> 
> > 
> > Well, there is a path not calling btrfs_cleanup_ordered_extents(), which is
> > submit_uncompressed_range(). So, this path needs to call it.
> 
> Sounds right.
> 
> > 
> > Or should we do the clean-up within cow_file_range()? It is more
> > understandable to clean the extents where it is created in the error
> > case. But then, run_delalloc_nocow() should also clean the ordered extents
> > created by itself (BTRFS_ORDERED_PREALLOC or BTRFS_ORDERED_NOCOW extents).
> 
> I think it's fine as it is.
> In case it confuses someone, we could just leave a comment at cow_file_range(),
> etc, telling we do cleanup of previously created ordered extents at
> btrfs_run_delalloc_range().
> 
> 
> > 
> > > So on error we need to go through previously created ordered extents, set
> > > the IOERR flag on them, complete them to wake up any waiters and remove it,
> > > which also takes care or adding the reserved extent back to the free space
> > > cache/tree.
> > 
> > I think this is exactly done by btrfs_cleanup_ordered_extents() +
> > end_extent_writepage() in __extent_writepage(). No?
> > 
> > So, in the end, we just need to unlock the pages except locked_page in the
> > error case.
> 
> So I'm back to my initial review, where everything seems fine except for
> the small mistake in the comment.

Thank you for confirming. I'll send a new version with comment fixed+added
and fixes for extent_clear_unlock_delalloc argument.

> 
> Thanks.
> 
> > 
> > > 
> > > 
> > > > 
> > > > On 13/12/2021 04:43, Naohiro Aota wrote:
> > > > > There is a hung_task report regarding page lock on zoned btrfs like below.
> > > > > 
> > > > > https://github.com/naota/linux/issues/59
> > > > > 
> > > > > [  726.328648] INFO: task rocksdb:high0:11085 blocked for more than 241 seconds.
> > > > > [  726.329839]       Not tainted 5.16.0-rc1+ #1
> > > > > [  726.330484] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > > > [  726.331603] task:rocksdb:high0   state:D stack:    0 pid:11085 ppid: 11082 flags:0x00000000
> > > > > [  726.331608] Call Trace:
> > > > > [  726.331611]  <TASK>
> > > > > [  726.331614]  __schedule+0x2e5/0x9d0
> > > > > [  726.331622]  schedule+0x58/0xd0
> > > > > [  726.331626]  io_schedule+0x3f/0x70
> > > > > [  726.331629]  __folio_lock+0x125/0x200
> > > > > [  726.331634]  ? find_get_entries+0x1bc/0x240
> > > > > [  726.331638]  ? filemap_invalidate_unlock_two+0x40/0x40
> > > > > [  726.331642]  truncate_inode_pages_range+0x5b2/0x770
> > > > > [  726.331649]  truncate_inode_pages_final+0x44/0x50
> > > > > [  726.331653]  btrfs_evict_inode+0x67/0x480
> > > > > [  726.331658]  evict+0xd0/0x180
> > > > > [  726.331661]  iput+0x13f/0x200
> > > > > [  726.331664]  do_unlinkat+0x1c0/0x2b0
> > > > > [  726.331668]  __x64_sys_unlink+0x23/0x30
> > > > > [  726.331670]  do_syscall_64+0x3b/0xc0
> > > > > [  726.331674]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> > > > > [  726.331677] RIP: 0033:0x7fb9490a171b
> > > > > [  726.331681] RSP: 002b:00007fb943ffac68 EFLAGS: 00000246 ORIG_RAX: 0000000000000057
> > > > > [  726.331684] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fb9490a171b
> > > > > [  726.331686] RDX: 00007fb943ffb040 RSI: 000055a6bbe6ec20 RDI: 00007fb94400d300
> > > > > [  726.331687] RBP: 00007fb943ffad00 R08: 0000000000000000 R09: 0000000000000000
> > > > > [  726.331688] R10: 0000000000000031 R11: 0000000000000246 R12: 00007fb943ffb000
> > > > > [  726.331690] R13: 00007fb943ffb040 R14: 0000000000000000 R15: 00007fb943ffd260
> > > > > [  726.331693]  </TASK>
> > > > > 
> > > > > While we debug the issue, we found running fstests generic/551 on 5GB
> > > > > non-zoned null_blk device in the emulated zoned mode also had a
> > > > > similar hung issue.
> > > > > 
> > > > > The hang occurs when cow_file_range() fails in the middle of
> > > > > allocation. cow_file_range() called from do_allocation_zoned() can
> > > > > split the give region ([start, end]) for allocation depending on
> > > > > current block group usages. When btrfs can allocate bytes for one part
> > > > > of the split regions but fails for the other region (e.g. because of
> > > > > -ENOSPC), we return the error leaving the pages in the succeeded regions
> > > > > locked. Technically, this occurs only when @unlock == 0. Otherwise, we
> > > > > unlock the pages in an allocated region after creating an ordered
> > > > > extent.
> > > > > 
> > > > > Theoretically, the same issue can happen on
> > > > > submit_uncompressed_range(). However, I could not make it happen even
> > > > > if I modified the code to go always through
> > > > > submit_uncompressed_range().
> > > > > 
> > > > > Considering the callers of cow_file_range(unlock=0) won't write out
> > > > > the pages, we can unlock the pages on error exit from
> > > > > cow_file_range(). So, we can ensure all the pages except @locked_page
> > > > > are unlocked on error case.
> > > > > 
> > > > > In summary, cow_file_range now behaves like this:
> > > > > 
> > > > > - page_started == 1 (return value)
> > > > >   - All the pages are unlocked. IO is started.
> > > > > - unlock == 0
> > > > >   - All the pages except @locked_page are unlocked in any case
> > > > > - unlock == 1
> > > > >   - On success, all the pages are locked for writing out them
> > > > >   - On failure, all the pages except @locked_page are unlocked
> > > > >
diff mbox series

Patch

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index b8c911a4a320..e21c94bbc56c 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1109,6 +1109,22 @@  static u64 get_extent_allocation_hint(struct btrfs_inode *inode, u64 start,
  * *page_started is set to one if we unlock locked_page and do everything
  * required to start IO on it.  It may be clean and already done with
  * IO when we return.
+ *
+ * When unlock == 1, we unlock the pages in successfully allocated regions. We
+ * leave them locked otherwise for writing them out.
+ *
+ * However, we unlock all the pages except @locked_page in case of failure.
+ *
+ * In summary, page locking state will be as follow:
+ *
+ * - page_started == 1 (return value)
+ *     - All the pages are unlocked. IO is started.
+ *     - Note that this can happen only on success
+ * - unlock == 0
+ *     - All the pages except @locked_page are unlocked in any case
+ * - unlock == 1
+ *     - On success, all the pages are locked for writing out them
+ *     - On failure, all the pages except @locked_page are unlocked
  */
 static noinline int cow_file_range(struct btrfs_inode *inode,
 				   struct page *locked_page,
@@ -1118,6 +1134,7 @@  static noinline int cow_file_range(struct btrfs_inode *inode,
 	struct btrfs_root *root = inode->root;
 	struct btrfs_fs_info *fs_info = root->fs_info;
 	u64 alloc_hint = 0;
+	u64 orig_start = start;
 	u64 num_bytes;
 	unsigned long ram_size;
 	u64 cur_alloc_size = 0;
@@ -1305,6 +1322,10 @@  static noinline int cow_file_range(struct btrfs_inode *inode,
 	clear_bits = EXTENT_LOCKED | EXTENT_DELALLOC | EXTENT_DELALLOC_NEW |
 		EXTENT_DEFRAG | EXTENT_CLEAR_META_RESV;
 	page_ops = PAGE_UNLOCK | PAGE_START_WRITEBACK | PAGE_END_WRITEBACK;
+	/* Ensure we unlock all the pages except @locked_page in the error case */
+	if (!unlock && orig_start < start)
+		extent_clear_unlock_delalloc(inode, orig_start, start - 1,
+					     locked_page, clear_bits, page_ops);
 	/*
 	 * If we reserved an extent for our delalloc range (or a subrange) and
 	 * failed to create the respective ordered extent, then it means that