[v5] btrfs: Handle delalloc error correctly to avoid ordered extent hang

[BUG]
Reports about btrfs hang running btrfs/124 with default mount option and
btrfs/125 with nospace_cache or space_cache=v2 mount options, with
following backtrace.

Call Trace:
 __schedule+0x2d4/0xae0
 schedule+0x3d/0x90
 btrfs_start_ordered_extent+0x160/0x200 [btrfs]
 ? wake_atomic_t_function+0x60/0x60
 btrfs_run_ordered_extent_work+0x25/0x40 [btrfs]
 btrfs_scrubparity_helper+0x1c1/0x620 [btrfs]
 btrfs_flush_delalloc_helper+0xe/0x10 [btrfs]
 process_one_work+0x2af/0x720
 ? process_one_work+0x22b/0x720
 worker_thread+0x4b/0x4f0
 kthread+0x10f/0x150
 ? process_one_work+0x720/0x720
 ? kthread_create_on_node+0x40/0x40
 ret_from_fork+0x2e/0x40

[CAUSE]
The problem is caused by error handler in run_delalloc_nocow() doesn't
handle error from btrfs_reloc_clone_csums() well.

Error handlers in run_dealloc_nocow() and cow_file_range() will clear
dirty flags and finish writeback for remaining pages like the following:

|<------------------ delalloc range --------------------------->|
| Ordered extent 1 | Ordered extent 2          |
|    Submitted OK  | recloc_clone_csum() error |
|<>|               |<----------- cleanup range ---------------->|
 ||
 \_=> First page handled by end_extent_writepage() in __extent_writepage()

This behavior has two problems:
1) Ordered extent 2 will never finish
   Ordered extent 2 is already submitted, which relies endio hooks to
   wait for all its pages to finish.

   However since we finish writeback in error handler, ordered extent 2
   will never finish.

2) Metadata underflow
   btrfs_finish_ordered_io() for ordered extent will free its reserved
   metadata space, while error handlers will also free metadata space of
   the remaining range, which covers ordered extent 2.

   So even if problem 1) is solved, we can still under flow metadata
   reservation, which will leads to deadly btrfs assertion.

[FIX]
This patch will resolve the problem in two steps:
1) Introduce btrfs_cleanup_ordered_extents() to cleanup ordered extents
   Slightly modify one existing function,
   btrfs_endio_direct_write_update_ordered() to handle free space inode
   just like btrfs_writepage_endio_hook() and skip first page to
   co-operate with end_extent_writepage().

   So btrfs_cleanup_ordered_extents() will search all submitted ordered
   extent in specified range, and clean them up except the first page.

2) Make error handlers skip any range covered by ordered extent
   For run_delalloc_nocow() and cow_file_range(), only allow error
   handlers to clean up pages/extents not covered by submitted ordered
   extent.

   For compression, it's calling writepage_end_io_hook() itself to handle
   its error, and any submitted ordered extent will have its bio
   submitted, so no need to worry about compression part.

After the fix, the clean up will happen like:

|<--------------------------- delalloc range --------------------------->|
| Ordered extent 1 | Ordered extent 2          |
|    Submitted OK  | recloc_clone_csum() error |
|<>|<- Cleaned up by cleanup_ordered_extents ->|<-- old error handler--->|
 ||
 \_=> First page handled by end_extent_writepage() in __extent_writepage()

Suggested-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
---
v2:
  Add BTRFS_ORDERED_SKIP_METADATA flag to avoid double reducing
  outstanding extents, which is already done by
  extent_clear_unlock_delalloc() with EXTENT_DO_ACCOUNT control bit
v3:
  Skip first page to avoid underflow ordered->bytes_left.
  Fix range passed in cow_file_range() which doesn't cover the whole
  extent.
  Expend extent_clear_unlock_delalloc() range to allow them to handle
  metadata release.
v4:
  Don't use extra bit to skip metadata freeing for ordered extent,
  but only handle btrfs_reloc_clone_csums() error just before processing
  next extent.
  This makes error handle much easier for run_delalloc_nocow().
v5:
  Variant gramma and comment fixes suggested by Filipe Manana
  Enhanced commit message to focus on the generic error handler bug,
  pointed out by Filipe Manana
  Adding missing wq/func user in __endio_write_update_ordered(), pointed
  out by Filipe Manana.
  Enhanced commit message with ascii art to explain the bug more easily.
  Fix a bug which can leads to corrupted extent allocation, exposed by
  Filipe Manana.
---
 fs/btrfs/inode.c | 116 ++++++++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 97 insertions(+), 19 deletions(-)

[v5] btrfs: Handle delalloc error correctly to avoid ordered extent hang

Commit Message

Patch