diff mbox series

xfs: ensure submit buffers on LSN boundaries in error handlers

Message ID 20231228124646.142757-1-leo.lilong@huawei.com (mailing list archive)
State Superseded, archived
Headers show
Series xfs: ensure submit buffers on LSN boundaries in error handlers | expand

Commit Message

Long Li Dec. 28, 2023, 12:46 p.m. UTC
While performing the IO fault injection test, I caught the following data
corruption report:

 XFS (dm-0): Internal error ltbno + ltlen > bno at line 1957 of file fs/xfs/libxfs/xfs_alloc.c.  Caller xfs_free_ag_extent+0x79c/0x1130
 CPU: 3 PID: 33 Comm: kworker/3:0 Not tainted 6.5.0-rc7-next-20230825-00001-g7f8666926889 #214
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014
 Workqueue: xfs-inodegc/dm-0 xfs_inodegc_worker
 Call Trace:
  <TASK>
  dump_stack_lvl+0x50/0x70
  xfs_corruption_error+0x134/0x150
  xfs_free_ag_extent+0x7d3/0x1130
  __xfs_free_extent+0x201/0x3c0
  xfs_trans_free_extent+0x29b/0xa10
  xfs_extent_free_finish_item+0x2a/0xb0
  xfs_defer_finish_noroll+0x8d1/0x1b40
  xfs_defer_finish+0x21/0x200
  xfs_itruncate_extents_flags+0x1cb/0x650
  xfs_free_eofblocks+0x18f/0x250
  xfs_inactive+0x485/0x570
  xfs_inodegc_worker+0x207/0x530
  process_scheduled_works+0x24a/0xe10
  worker_thread+0x5ac/0xc60
  kthread+0x2cd/0x3c0
  ret_from_fork+0x4a/0x80
  ret_from_fork_asm+0x11/0x20
  </TASK>
 XFS (dm-0): Corruption detected. Unmount and run xfs_repair

After analyzing the disk image, it was found that the corruption was
triggered by the fact that extent was recorded in both the inode and AGF
btrees. After a long time of reproduction and analysis, we found that the
root cause of the problem was that the AGF btree block was not recovered.

Consider the following situation, Transaction A and Transaction B are in
the same record, so Transaction A and Transaction B share the same LSN1.
If the buf item in Transaction A has been recovered, then the buf item in
Transaction B cannot be recovered, because log recovery skips items with a
metadata LSN >= the current LSN of the recovery item. If there is still an
inode item in transaction B that records the Extent X, the Extent X will
be recorded in both the inode and the AGF btree block after transaction B
is recovered.

  |------------Record (LSN1)------------------|---Record (LSN2)---|
  |----------Trans A------------|-------------Trans B-------------|
  |     Buf Item(Extent X)      | Buf Item / Inode item(Extent X) |
  |     Extent X is freed       |     Extent X is allocated       |

After commit 12818d24db8a ("xfs: rework log recovery to submit buffers on
LSN boundaries") was introduced, we submit buffers on lsn boundaries during
log recovery. The above problem can be avoided under normal paths, but it's
not guaranteed under abnormal paths. Consider the following process, if an
error was encountered after recover buf item in transaction A and before
recover buf item in transaction B, buffers that have been added to
buffer_list will still be submitted, this violates the submits rule on lsn
boundaries. So buf item in Transaction B cannot be recovered on the next
mount due to current lsn of transaction equal to metadata lsn on disk.

  xlog_do_recovery_pass
    error = xlog_recover_process
      xlog_recover_process_data
        ...
          xlog_recover_buf_commit_pass2
            xlog_recover_do_reg_buffer  //recover buf item in Trans A
            xfs_buf_delwri_queue(bp, buffer_list)
        ...
        ====> Encountered error and returned
        ...
          xlog_recover_buf_commit_pass2
            xlog_recover_do_reg_buffer  //recover buf item in Trans B
            xfs_buf_delwri_queue(bp, buffer_list)
    if (!list_empty(&buffer_list))
      xfs_buf_delwri_submit(&buffer_list); //submit regardless of error

In order to make sure that submits buffers on lsn boundaries in the
abnormal paths, we need to check error status before submit buffers that
have been added from the last record processed. If error status exist,
buffers in the bufffer_list should be canceled.

Canceling the buffers in the buffer_list directly isn't correct, unlike
any other place where write list was canceled, these buffers has been
initialized by xfs_buf_item_init() during recovery and held by buf
item, buf items will not be released in xfs_buf_delwri_cancel(). If
these buffers are submitted successfully, buf items assocated with
the buffer will be released in io end process. So releasing buf item
in write list cacneling process is needed.

Fixes: 50d5c8d8e938 ("xfs: check LSN ordering for v5 superblocks during recovery")
Signed-off-by: Long Li <leo.lilong@huawei.com>
---
 fs/xfs/xfs_buf.c         |  2 ++
 fs/xfs/xfs_log_recover.c | 22 +++++++++++++---------
 2 files changed, 15 insertions(+), 9 deletions(-)

Comments

Darrick J. Wong Jan. 4, 2024, 2:01 a.m. UTC | #1
On Thu, Dec 28, 2023 at 08:46:46PM +0800, Long Li wrote:
> While performing the IO fault injection test, I caught the following data
> corruption report:
> 
>  XFS (dm-0): Internal error ltbno + ltlen > bno at line 1957 of file fs/xfs/libxfs/xfs_alloc.c.  Caller xfs_free_ag_extent+0x79c/0x1130
>  CPU: 3 PID: 33 Comm: kworker/3:0 Not tainted 6.5.0-rc7-next-20230825-00001-g7f8666926889 #214
>  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014
>  Workqueue: xfs-inodegc/dm-0 xfs_inodegc_worker
>  Call Trace:
>   <TASK>
>   dump_stack_lvl+0x50/0x70
>   xfs_corruption_error+0x134/0x150
>   xfs_free_ag_extent+0x7d3/0x1130
>   __xfs_free_extent+0x201/0x3c0
>   xfs_trans_free_extent+0x29b/0xa10
>   xfs_extent_free_finish_item+0x2a/0xb0
>   xfs_defer_finish_noroll+0x8d1/0x1b40
>   xfs_defer_finish+0x21/0x200
>   xfs_itruncate_extents_flags+0x1cb/0x650
>   xfs_free_eofblocks+0x18f/0x250
>   xfs_inactive+0x485/0x570
>   xfs_inodegc_worker+0x207/0x530
>   process_scheduled_works+0x24a/0xe10
>   worker_thread+0x5ac/0xc60
>   kthread+0x2cd/0x3c0
>   ret_from_fork+0x4a/0x80
>   ret_from_fork_asm+0x11/0x20
>   </TASK>
>  XFS (dm-0): Corruption detected. Unmount and run xfs_repair
> 
> After analyzing the disk image, it was found that the corruption was
> triggered by the fact that extent was recorded in both the inode and AGF
> btrees. After a long time of reproduction and analysis, we found that the
> root cause of the problem was that the AGF btree block was not recovered.
> 
> Consider the following situation, Transaction A and Transaction B are in
> the same record, so Transaction A and Transaction B share the same LSN1.
> If the buf item in Transaction A has been recovered, then the buf item in
> Transaction B cannot be recovered, because log recovery skips items with a
> metadata LSN >= the current LSN of the recovery item. If there is still an
> inode item in transaction B that records the Extent X, the Extent X will
> be recorded in both the inode and the AGF btree block after transaction B
> is recovered.
> 
>   |------------Record (LSN1)------------------|---Record (LSN2)---|
>   |----------Trans A------------|-------------Trans B-------------|
>   |     Buf Item(Extent X)      | Buf Item / Inode item(Extent X) |
>   |     Extent X is freed       |     Extent X is allocated       |
> 
> After commit 12818d24db8a ("xfs: rework log recovery to submit buffers on
> LSN boundaries") was introduced, we submit buffers on lsn boundaries during
> log recovery. The above problem can be avoided under normal paths, but it's
> not guaranteed under abnormal paths. Consider the following process, if an
> error was encountered after recover buf item in transaction A and before
> recover buf item in transaction B, buffers that have been added to
> buffer_list will still be submitted, this violates the submits rule on lsn
> boundaries. So buf item in Transaction B cannot be recovered on the next
> mount due to current lsn of transaction equal to metadata lsn on disk.
> 
>   xlog_do_recovery_pass
>     error = xlog_recover_process
>       xlog_recover_process_data
>         ...
>           xlog_recover_buf_commit_pass2
>             xlog_recover_do_reg_buffer  //recover buf item in Trans A
>             xfs_buf_delwri_queue(bp, buffer_list)
>         ...
>         ====> Encountered error and returned
>         ...
>           xlog_recover_buf_commit_pass2
>             xlog_recover_do_reg_buffer  //recover buf item in Trans B
>             xfs_buf_delwri_queue(bp, buffer_list)
>     if (!list_empty(&buffer_list))
>       xfs_buf_delwri_submit(&buffer_list); //submit regardless of error
> 
> In order to make sure that submits buffers on lsn boundaries in the
> abnormal paths, we need to check error status before submit buffers that
> have been added from the last record processed. If error status exist,
> buffers in the bufffer_list should be canceled.

What was the error, specifically?  I would have though that recovery
would abort after "Encountered error and returned".  Does the recovery
somehow keep running and then finds the buf item in Trans B?

Or is the problem here that after the error, xfs submits the delwri
buffers?  And then the user tried to recover a second time, only this
time the recovery attempt reads Trans B, but then doesn't actually write
anything because the ondisk buffer now has the same LSN as Trans B?

<confused>

--D

> Canceling the buffers in the buffer_list directly isn't correct, unlike
> any other place where write list was canceled, these buffers has been
> initialized by xfs_buf_item_init() during recovery and held by buf
> item, buf items will not be released in xfs_buf_delwri_cancel(). If
> these buffers are submitted successfully, buf items assocated with
> the buffer will be released in io end process. So releasing buf item
> in write list cacneling process is needed.
> 
> Fixes: 50d5c8d8e938 ("xfs: check LSN ordering for v5 superblocks during recovery")
> Signed-off-by: Long Li <leo.lilong@huawei.com>
> ---
>  fs/xfs/xfs_buf.c         |  2 ++
>  fs/xfs/xfs_log_recover.c | 22 +++++++++++++---------
>  2 files changed, 15 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index 8e5bd50d29fe..6a1b26aaf97e 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -2075,6 +2075,8 @@ xfs_buf_delwri_cancel(
>  		xfs_buf_lock(bp);
>  		bp->b_flags &= ~_XBF_DELWRI_Q;
>  		xfs_buf_list_del(bp);
> +		if (bp->b_log_item)
> +			xfs_buf_item_relse(bp);
>  		xfs_buf_relse(bp);
>  	}
>  }
> diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> index 1251c81e55f9..2cda6c90890d 100644
> --- a/fs/xfs/xfs_log_recover.c
> +++ b/fs/xfs/xfs_log_recover.c
> @@ -2964,7 +2964,6 @@ xlog_do_recovery_pass(
>  	char			*offset;
>  	char			*hbp, *dbp;
>  	int			error = 0, h_size, h_len;
> -	int			error2 = 0;
>  	int			bblks, split_bblks;
>  	int			hblks, split_hblks, wrapped_hblks;
>  	int			i;
> @@ -3203,16 +3202,21 @@ xlog_do_recovery_pass(
>   bread_err1:
>  	kmem_free(hbp);
>  
> -	/*
> -	 * Submit buffers that have been added from the last record processed,
> -	 * regardless of error status.
> -	 */
> -	if (!list_empty(&buffer_list))
> -		error2 = xfs_buf_delwri_submit(&buffer_list);
> -
>  	if (error && first_bad)
>  		*first_bad = rhead_blk;
>  
> +	/*
> +	 * If there are no error, submit buffers that have been added from the
> +	 * last record processed, othrewise cancel the write list, to ensure
> +	 * submit buffers on LSN boundaries.
> +	 */
> +	if (!list_empty(&buffer_list)) {
> +		if (error)
> +			xfs_buf_delwri_cancel(&buffer_list);
> +		else
> +			error = xfs_buf_delwri_submit(&buffer_list);
> +	}
> +
>  	/*
>  	 * Transactions are freed at commit time but transactions without commit
>  	 * records on disk are never committed. Free any that may be left in the
> @@ -3226,7 +3230,7 @@ xlog_do_recovery_pass(
>  			xlog_recover_free_trans(trans);
>  	}
>  
> -	return error ? error : error2;
> +	return error;
>  }
>  
>  /*
> -- 
> 2.31.1
> 
>
Long Li Jan. 4, 2024, 1:03 p.m. UTC | #2
On Wed, Jan 03, 2024 at 06:01:21PM -0800, Darrick J. Wong wrote:
> On Thu, Dec 28, 2023 at 08:46:46PM +0800, Long Li wrote:
> > While performing the IO fault injection test, I caught the following data
> > corruption report:
> > 
> >  XFS (dm-0): Internal error ltbno + ltlen > bno at line 1957 of file fs/xfs/libxfs/xfs_alloc.c.  Caller xfs_free_ag_extent+0x79c/0x1130
> >  CPU: 3 PID: 33 Comm: kworker/3:0 Not tainted 6.5.0-rc7-next-20230825-00001-g7f8666926889 #214
> >  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014
> >  Workqueue: xfs-inodegc/dm-0 xfs_inodegc_worker
> >  Call Trace:
> >   <TASK>
> >   dump_stack_lvl+0x50/0x70
> >   xfs_corruption_error+0x134/0x150
> >   xfs_free_ag_extent+0x7d3/0x1130
> >   __xfs_free_extent+0x201/0x3c0
> >   xfs_trans_free_extent+0x29b/0xa10
> >   xfs_extent_free_finish_item+0x2a/0xb0
> >   xfs_defer_finish_noroll+0x8d1/0x1b40
> >   xfs_defer_finish+0x21/0x200
> >   xfs_itruncate_extents_flags+0x1cb/0x650
> >   xfs_free_eofblocks+0x18f/0x250
> >   xfs_inactive+0x485/0x570
> >   xfs_inodegc_worker+0x207/0x530
> >   process_scheduled_works+0x24a/0xe10
> >   worker_thread+0x5ac/0xc60
> >   kthread+0x2cd/0x3c0
> >   ret_from_fork+0x4a/0x80
> >   ret_from_fork_asm+0x11/0x20
> >   </TASK>
> >  XFS (dm-0): Corruption detected. Unmount and run xfs_repair
> > 
> > After analyzing the disk image, it was found that the corruption was
> > triggered by the fact that extent was recorded in both the inode and AGF
> > btrees. After a long time of reproduction and analysis, we found that the
> > root cause of the problem was that the AGF btree block was not recovered.
> > 
> > Consider the following situation, Transaction A and Transaction B are in
> > the same record, so Transaction A and Transaction B share the same LSN1.
> > If the buf item in Transaction A has been recovered, then the buf item in
> > Transaction B cannot be recovered, because log recovery skips items with a
> > metadata LSN >= the current LSN of the recovery item. If there is still an
> > inode item in transaction B that records the Extent X, the Extent X will
> > be recorded in both the inode and the AGF btree block after transaction B
> > is recovered.
> > 
> >   |------------Record (LSN1)------------------|---Record (LSN2)---|
> >   |----------Trans A------------|-------------Trans B-------------|
> >   |     Buf Item(Extent X)      | Buf Item / Inode item(Extent X) |
> >   |     Extent X is freed       |     Extent X is allocated       |
> > 
> > After commit 12818d24db8a ("xfs: rework log recovery to submit buffers on
> > LSN boundaries") was introduced, we submit buffers on lsn boundaries during
> > log recovery. The above problem can be avoided under normal paths, but it's
> > not guaranteed under abnormal paths. Consider the following process, if an
> > error was encountered after recover buf item in transaction A and before
> > recover buf item in transaction B, buffers that have been added to
> > buffer_list will still be submitted, this violates the submits rule on lsn
> > boundaries. So buf item in Transaction B cannot be recovered on the next
> > mount due to current lsn of transaction equal to metadata lsn on disk.
> > 
> >   xlog_do_recovery_pass
> >     error = xlog_recover_process
> >       xlog_recover_process_data
> >         ...
> >           xlog_recover_buf_commit_pass2
> >             xlog_recover_do_reg_buffer  //recover buf item in Trans A
> >             xfs_buf_delwri_queue(bp, buffer_list)
> >         ...
> >         ====> Encountered error and returned
> >         ...
> >           xlog_recover_buf_commit_pass2
> >             xlog_recover_do_reg_buffer  //recover buf item in Trans B
> >             xfs_buf_delwri_queue(bp, buffer_list)
> >     if (!list_empty(&buffer_list))
> >       xfs_buf_delwri_submit(&buffer_list); //submit regardless of error
> > 
> > In order to make sure that submits buffers on lsn boundaries in the
> > abnormal paths, we need to check error status before submit buffers that
> > have been added from the last record processed. If error status exist,
> > buffers in the bufffer_list should be canceled.
> 
> What was the error, specifically?  I would have though that recovery
> would abort after "Encountered error and returned".  Does the recovery
> somehow keep running and then finds the buf item in Trans B?
> 

That was not what I meant. I'm just trying to point out that any error
that occurs after recovering buf item in Transaction A and before
recovering buf item in Transaction B can trigger the problem. It
doesn't matter what the error is, for example the buf read error
that occurred during this period. 

> Or is the problem here that after the error, xfs submits the delwri
> buffers?  And then the user tried to recover a second time, only this
> time the recovery attempt reads Trans B, but then doesn't actually write
> anything because the ondisk buffer now has the same LSN as Trans B?
> 

Yes, that's what I want to said.  I think I should change the description
of the process that triggered the issue in commit message to avoid
misunderstandings.

Thanks
Long Li

> <confused>
> 
> --D
> 
> > Canceling the buffers in the buffer_list directly isn't correct, unlike
> > any other place where write list was canceled, these buffers has been
> > initialized by xfs_buf_item_init() during recovery and held by buf
> > item, buf items will not be released in xfs_buf_delwri_cancel(). If
> > these buffers are submitted successfully, buf items assocated with
> > the buffer will be released in io end process. So releasing buf item
> > in write list cacneling process is needed.
> > 
> > Fixes: 50d5c8d8e938 ("xfs: check LSN ordering for v5 superblocks during recovery")
> > Signed-off-by: Long Li <leo.lilong@huawei.com>
> > ---
> >  fs/xfs/xfs_buf.c         |  2 ++
> >  fs/xfs/xfs_log_recover.c | 22 +++++++++++++---------
> >  2 files changed, 15 insertions(+), 9 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> > index 8e5bd50d29fe..6a1b26aaf97e 100644
> > --- a/fs/xfs/xfs_buf.c
> > +++ b/fs/xfs/xfs_buf.c
> > @@ -2075,6 +2075,8 @@ xfs_buf_delwri_cancel(
> >  		xfs_buf_lock(bp);
> >  		bp->b_flags &= ~_XBF_DELWRI_Q;
> >  		xfs_buf_list_del(bp);
> > +		if (bp->b_log_item)
> > +			xfs_buf_item_relse(bp);
> >  		xfs_buf_relse(bp);
> >  	}
> >  }
> > diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> > index 1251c81e55f9..2cda6c90890d 100644
> > --- a/fs/xfs/xfs_log_recover.c
> > +++ b/fs/xfs/xfs_log_recover.c
> > @@ -2964,7 +2964,6 @@ xlog_do_recovery_pass(
> >  	char			*offset;
> >  	char			*hbp, *dbp;
> >  	int			error = 0, h_size, h_len;
> > -	int			error2 = 0;
> >  	int			bblks, split_bblks;
> >  	int			hblks, split_hblks, wrapped_hblks;
> >  	int			i;
> > @@ -3203,16 +3202,21 @@ xlog_do_recovery_pass(
> >   bread_err1:
> >  	kmem_free(hbp);
> >  
> > -	/*
> > -	 * Submit buffers that have been added from the last record processed,
> > -	 * regardless of error status.
> > -	 */
> > -	if (!list_empty(&buffer_list))
> > -		error2 = xfs_buf_delwri_submit(&buffer_list);
> > -
> >  	if (error && first_bad)
> >  		*first_bad = rhead_blk;
> >  
> > +	/*
> > +	 * If there are no error, submit buffers that have been added from the
> > +	 * last record processed, othrewise cancel the write list, to ensure
> > +	 * submit buffers on LSN boundaries.
> > +	 */
> > +	if (!list_empty(&buffer_list)) {
> > +		if (error)
> > +			xfs_buf_delwri_cancel(&buffer_list);
> > +		else
> > +			error = xfs_buf_delwri_submit(&buffer_list);
> > +	}
> > +
> >  	/*
> >  	 * Transactions are freed at commit time but transactions without commit
> >  	 * records on disk are never committed. Free any that may be left in the
> > @@ -3226,7 +3230,7 @@ xlog_do_recovery_pass(
> >  			xlog_recover_free_trans(trans);
> >  	}
> >  
> > -	return error ? error : error2;
> > +	return error;
> >  }
> >  
> >  /*
> > -- 
> > 2.31.1
> > 
> > 
>
Dave Chinner Jan. 7, 2024, 10:13 p.m. UTC | #3
On Thu, Dec 28, 2023 at 08:46:46PM +0800, Long Li wrote:
> While performing the IO fault injection test, I caught the following data
> corruption report:
> 
>  XFS (dm-0): Internal error ltbno + ltlen > bno at line 1957 of file fs/xfs/libxfs/xfs_alloc.c.  Caller xfs_free_ag_extent+0x79c/0x1130
>  CPU: 3 PID: 33 Comm: kworker/3:0 Not tainted 6.5.0-rc7-next-20230825-00001-g7f8666926889 #214
>  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014
>  Workqueue: xfs-inodegc/dm-0 xfs_inodegc_worker
>  Call Trace:
>   <TASK>
>   dump_stack_lvl+0x50/0x70
>   xfs_corruption_error+0x134/0x150
>   xfs_free_ag_extent+0x7d3/0x1130
>   __xfs_free_extent+0x201/0x3c0
>   xfs_trans_free_extent+0x29b/0xa10
>   xfs_extent_free_finish_item+0x2a/0xb0
>   xfs_defer_finish_noroll+0x8d1/0x1b40
>   xfs_defer_finish+0x21/0x200
>   xfs_itruncate_extents_flags+0x1cb/0x650
>   xfs_free_eofblocks+0x18f/0x250
>   xfs_inactive+0x485/0x570
>   xfs_inodegc_worker+0x207/0x530
>   process_scheduled_works+0x24a/0xe10
>   worker_thread+0x5ac/0xc60
>   kthread+0x2cd/0x3c0
>   ret_from_fork+0x4a/0x80
>   ret_from_fork_asm+0x11/0x20
>   </TASK>
>  XFS (dm-0): Corruption detected. Unmount and run xfs_repair
> 
> After analyzing the disk image, it was found that the corruption was
> triggered by the fact that extent was recorded in both the inode and AGF
> btrees. After a long time of reproduction and analysis, we found that the
> root cause of the problem was that the AGF btree block was not recovered.

Why was it not recovered? Because of an injected IO error during
recovery?

> Consider the following situation, Transaction A and Transaction B are in
> the same record, so Transaction A and Transaction B share the same LSN1.
> If the buf item in Transaction A has been recovered, then the buf item in
> Transaction B cannot be recovered, because log recovery skips items with a
> metadata LSN >= the current LSN of the recovery item.

This makes no sense to me. Transactions don't exist in the journal;
they are purely in-memory constructs that are aggregated
in memory (in the CIL) before being written to disk as an atomic
checkpoint. Hence a log item can only appear once in a checkpoint
regardless of how many transactions it is modified in memory between
CIL checkpoints.

> If there is still an
> inode item in transaction B that records the Extent X, the Extent X will
> be recorded in both the inode and the AGF btree block after transaction B
> is recovered.

That transaction should record both the addition to the inode BMBT
and the removal from the AGF. Hence if transaction B is recovered in
full with no errors, this should not occur.

> 
>   |------------Record (LSN1)------------------|---Record (LSN2)---|
>   |----------Trans A------------|-------------Trans B-------------|
>   |     Buf Item(Extent X)      | Buf Item / Inode item(Extent X) |
>   |     Extent X is freed       |     Extent X is allocated       |

This looks wrong. A transaction can only exist in a single CIL
checkpoint and everything in a checkpoint has the same LSN. Hence we
cannot have the situation where trans B spans two different
checkpoints and hence span LSNs.

These are valid representations:

  |------------Record (LSN1)----|-----------------Record (LSN2)---|
  |----------Trans A------------|-------------Trans B-------------|

  |------------Record (LSN1)--------------------------------------|
  |----------Trans A------------|-------------Trans B-------------|

  |-----------------------------------------------Record (LSN2)---|
  |----------Trans A------------|-------------Trans B-------------|

Only in the first case are there two instances of the AGF buf item
object in the journal (one in each checkpoint). In the latter two
cases, there is only one copy of the AGF buf log item that contains
extent X. Indeed, it will not contain extent X, because the CIL
aggregation results in the addition in trans A being elided by the
removal in trans B, essentially resulting in the buffer being
unchanged except for the LSN after recovery.

As such, I'm really not sure what you are trying to describe here -
if recovery of the checkpoint at LSN1 fails in any way, we should
never attempt to recovery the checkpoint at LSN2. If LSN1 recoveres
entirely successfully, then LSN2 should see the correct state and
recover appropriately, too. Hence I don't see how the situation you
are describing arises.

> After commit 12818d24db8a ("xfs: rework log recovery to submit buffers on
> LSN boundaries") was introduced, we submit buffers on lsn boundaries during
> log recovery. 

Correct - we submit all the changes in a checkpoint for submission
before we start recovering the next checkpoint. That's because
checkpoints are supposed to be atomic units of change moving the
on-disk state from one change set to the next.

If any error during processing of a checkpoint occurs, we are
supposed to abort recovery at that checkpoint so we don't create a
situation where future recovery attempts skip checkpoints that need
to be recovered.

It does not matter if we write back the modified buffers from
partially completed checkpoints - they were successfully recovered
in their entirity, and so it is safe to write them back knowing that
the next attempt to recover the failed checkpoint will see a
matching LSN and skip that buffer item. If writeback fails, then it
just doesn't matter as the next recovery attempt will see the old
LSN and recover that buf item again and write it back....

AFAICT, you're describing things working as they are supposed to,
and I don't see where the problem you are attempting to fix is yet.

> The above problem can be avoided under normal paths, but it's
> not guaranteed under abnormal paths. Consider the following process, if an
> error was encountered after recover buf item in transaction A and before
> recover buf item in transaction B, buffers that have been added to
> buffer_list will still be submitted, this violates the submits rule on lsn
> boundaries. So buf item in Transaction B cannot be recovered on the next
> mount due to current lsn of transaction equal to metadata lsn on disk.
> 
>   xlog_do_recovery_pass
>     error = xlog_recover_process
>       xlog_recover_process_data
>         ...
>           xlog_recover_buf_commit_pass2
>             xlog_recover_do_reg_buffer  //recover buf item in Trans A
>             xfs_buf_delwri_queue(bp, buffer_list)
>         ...
>         ====> Encountered error and returned

What error is this, and why isn't it a fatal error causing the
checkpoint recovery to be aborted and the delwri list to be canceled?

>         ...
>           xlog_recover_buf_commit_pass2
>             xlog_recover_do_reg_buffer  //recover buf item in Trans B
>             xfs_buf_delwri_queue(bp, buffer_list)

This should never occur as we should be aborting log recovery on the
first failure, not continuing to process the checkpoint or starting
to process other checkpoints. Where are we failing to handle an
error?

>     if (!list_empty(&buffer_list))
>       xfs_buf_delwri_submit(&buffer_list); //submit regardless of error

Yes, that's fine (as per above). Indeed, this is how we handle
releasing the buffer log item on failure - this goes through IO
completion and that release the buf log item we added to the buffer
during recovery for LSN stamping.

> In order to make sure that submits buffers on lsn boundaries in the
> abnormal paths, we need to check error status before submit buffers that
> have been added from the last record processed. If error status exist,
> buffers in the bufffer_list should be canceled.

No, it does not need to be cancelled, it just needs to be processed.
Anything we've fully recovered is safe to write - it's no different
from having the system crash during AIL writeback having written
back these buffers and having to recover from part way through this
checkpoint.

> Canceling the buffers in the buffer_list directly isn't correct, unlike
> any other place where write list was canceled, these buffers has been
> initialized by xfs_buf_item_init() during recovery and held by buf
> item, buf items will not be released in xfs_buf_delwri_cancel(). If
> these buffers are submitted successfully, buf items assocated with
> the buffer will be released in io end process. So releasing buf item
> in write list cacneling process is needed.

Yes, that's why we use xfs_buf_delwri_submit() even on error - it
handles releasing the buffer log item in IO completion handling
(even on error).



> Fixes: 50d5c8d8e938 ("xfs: check LSN ordering for v5 superblocks during recovery")
> Signed-off-by: Long Li <leo.lilong@huawei.com>
> ---
>  fs/xfs/xfs_buf.c         |  2 ++
>  fs/xfs/xfs_log_recover.c | 22 +++++++++++++---------
>  2 files changed, 15 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index 8e5bd50d29fe..6a1b26aaf97e 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -2075,6 +2075,8 @@ xfs_buf_delwri_cancel(
>  		xfs_buf_lock(bp);
>  		bp->b_flags &= ~_XBF_DELWRI_Q;
>  		xfs_buf_list_del(bp);
> +		if (bp->b_log_item)
> +			xfs_buf_item_relse(bp);
>  		xfs_buf_relse(bp);
>  	}
>  }

I don't think this is a good idea - the delwri does not own the
buf log item reference, and so this will cause problems with
anything that already handles buf log item references correctly.

> diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> index 1251c81e55f9..2cda6c90890d 100644
> --- a/fs/xfs/xfs_log_recover.c
> +++ b/fs/xfs/xfs_log_recover.c
> @@ -2964,7 +2964,6 @@ xlog_do_recovery_pass(
>  	char			*offset;
>  	char			*hbp, *dbp;
>  	int			error = 0, h_size, h_len;
> -	int			error2 = 0;
>  	int			bblks, split_bblks;
>  	int			hblks, split_hblks, wrapped_hblks;
>  	int			i;
> @@ -3203,16 +3202,21 @@ xlog_do_recovery_pass(
>   bread_err1:
>  	kmem_free(hbp);
>  
> -	/*
> -	 * Submit buffers that have been added from the last record processed,
> -	 * regardless of error status.
> -	 */
> -	if (!list_empty(&buffer_list))
> -		error2 = xfs_buf_delwri_submit(&buffer_list);
> -
>  	if (error && first_bad)
>  		*first_bad = rhead_blk;
>  
> +	/*
> +	 * If there are no error, submit buffers that have been added from the
> +	 * last record processed, othrewise cancel the write list, to ensure
> +	 * submit buffers on LSN boundaries.
> +	 */
> +	if (!list_empty(&buffer_list)) {
> +		if (error)
> +			xfs_buf_delwri_cancel(&buffer_list);
> +		else
> +			error = xfs_buf_delwri_submit(&buffer_list);
> +	}

> +
>  	/*
>  	 * Transactions are freed at commit time but transactions without commit
>  	 * records on disk are never committed. Free any that may be left in the
> @@ -3226,7 +3230,7 @@ xlog_do_recovery_pass(
>  			xlog_recover_free_trans(trans);
>  	}
>  
> -	return error ? error : error2;
> +	return error;
>  }
>  
>  /*
> -- 
> 2.31.1
> 
> 
>
Long Li Jan. 8, 2024, 12:28 p.m. UTC | #4
Hi, Dave

Thanks for your reply.

On Mon, Jan 08, 2024 at 09:13:50AM +1100, Dave Chinner wrote:
> On Thu, Dec 28, 2023 at 08:46:46PM +0800, Long Li wrote:
> > While performing the IO fault injection test, I caught the following data
> > corruption report:
> > 
> >  XFS (dm-0): Internal error ltbno + ltlen > bno at line 1957 of file fs/xfs/libxfs/xfs_alloc.c.  Caller xfs_free_ag_extent+0x79c/0x1130
> >  CPU: 3 PID: 33 Comm: kworker/3:0 Not tainted 6.5.0-rc7-next-20230825-00001-g7f8666926889 #214
> >  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014
> >  Workqueue: xfs-inodegc/dm-0 xfs_inodegc_worker
> >  Call Trace:
> >   <TASK>
> >   dump_stack_lvl+0x50/0x70
> >   xfs_corruption_error+0x134/0x150
> >   xfs_free_ag_extent+0x7d3/0x1130
> >   __xfs_free_extent+0x201/0x3c0
> >   xfs_trans_free_extent+0x29b/0xa10
> >   xfs_extent_free_finish_item+0x2a/0xb0
> >   xfs_defer_finish_noroll+0x8d1/0x1b40
> >   xfs_defer_finish+0x21/0x200
> >   xfs_itruncate_extents_flags+0x1cb/0x650
> >   xfs_free_eofblocks+0x18f/0x250
> >   xfs_inactive+0x485/0x570
> >   xfs_inodegc_worker+0x207/0x530
> >   process_scheduled_works+0x24a/0xe10
> >   worker_thread+0x5ac/0xc60
> >   kthread+0x2cd/0x3c0
> >   ret_from_fork+0x4a/0x80
> >   ret_from_fork_asm+0x11/0x20
> >   </TASK>
> >  XFS (dm-0): Corruption detected. Unmount and run xfs_repair
> > 
> > After analyzing the disk image, it was found that the corruption was
> > triggered by the fact that extent was recorded in both the inode and AGF
> > btrees. After a long time of reproduction and analysis, we found that the
> > root cause of the problem was that the AGF btree block was not recovered.
> 
> Why was it not recovered? Because of an injected IO error during
> recovery?

The reason why the buf item of AGF btree is not recovery is that the LSN
of AGF btree block equal to the current LSN of the recovery item, Because
log recovery skips items with a metadata LSN >= the current LSN of the 
recovery item.

Injected IO error during recovery cause that the LSN of AGF btree block
equal to the current LSN of the recovery item, that's happend in the
situation of two transaction on disk share same LSN and both modify the
same buffer. Detailed information can be found below.

> 
> > Consider the following situation, Transaction A and Transaction B are in
> > the same record, so Transaction A and Transaction B share the same LSN1.
> > If the buf item in Transaction A has been recovered, then the buf item in
> > Transaction B cannot be recovered, because log recovery skips items with a
> > metadata LSN >= the current LSN of the recovery item.
> 
> This makes no sense to me. Transactions don't exist in the journal;
> they are purely in-memory constructs that are aggregated
> in memory (in the CIL) before being written to disk as an atomic
> checkpoint. Hence a log item can only appear once in a checkpoint
> regardless of how many transactions it is modified in memory between
> CIL checkpoints.
> 
> > If there is still an
> > inode item in transaction B that records the Extent X, the Extent X will
> > be recorded in both the inode and the AGF btree block after transaction B
> > is recovered.
> 
> That transaction should record both the addition to the inode BMBT
> and the removal from the AGF. Hence if transaction B is recovered in
> full with no errors, this should not occur.
> 
> > 
> >   |------------Record (LSN1)------------------|---Record (LSN2)---|
> >   |----------Trans A------------|-------------Trans B-------------|
> >   |     Buf Item(Extent X)      | Buf Item / Inode item(Extent X) |
> >   |     Extent X is freed       |     Extent X is allocated       |
> 
> This looks wrong. A transaction can only exist in a single CIL
> checkpoint and everything in a checkpoint has the same LSN. Hence we
> cannot have the situation where trans B spans two different
> checkpoints and hence span LSNs.

There is some misunderstanding here. Transactions that I said is on disk, not
in memory. Each transaction on disk corresponds to a checkpoint(This is my 
understanding, or we can call it as checkpoint transaction just like <<XFS
Algorithms & Data Structures>>), The two are easily confused, and their
meanings are not the same.

The transaction on disk can spans two different record. The following logs
show the details:

//Trans A, tid d0bfef23
//Trans B, tid 9a76bd30 

============================================================================
cycle: 271	version: 2		lsn: 271,14642	tail_lsn: 271,12644
length of Log Record: 32256	prev offset: 14608		num ops: 249
uuid: 01ce1afc-cedd-4120-8d8d-05fbee260af9   format: little endian linux
h_size: 32768
----------------------------------------------------------------------------
Oper (0): tid: d0bfef23  len: 0  clientid: TRANS  flags: START 
----------------------------------------------------------------------------
Oper (1): tid: d0bfef23  len: 16  clientid: TRANS  flags: none
TRAN:     tid: d0bfef23  num_items: 145
----------------------------------------------------------------------------
	  ......
----------------------------------------------------------------------------
Oper (5): tid: d0bfef23  len: 56  clientid: TRANS  flags: none
INODE: #regs: 3   ino: 0xc4aa3  flags: 0x41   dsize: 0
        blkno: 805536  len: 32  boff: 1536
Oper (6): tid: d0bfef23  len: 176  clientid: TRANS  flags: none
INODE CORE
magic 0x494e mode 0100666 version 3 format 2
nlink 1 uid 0 gid 0
atime 0xd014ae00 mtime 0xfe187700 ctime 0xfe187700
size 0x0 nblocks 0x0 extsize 0x0 nextents 0x0

/* inode 0xc4aa3 nblocks is 0 */

naextents 0x0 forkoff 35 dmevmask 0x0 dmstate 0x0
flags 0x0 gen 0x526bda01
flags2 0x8 cowextsize 0x0
Oper (7): tid: d0bfef23  len: 52  clientid: TRANS  flags: none
LOCAL attr data
----------------------------------------------------------------------------
	......
----------------------------------------------------------------------------
Oper (102): tid: d0bfef23  len: 24  clientid: TRANS  flags: none
BUF:  #regs: 2   start blkno: 1048577 (0x100001)  len: 1  bmap size: 1  flags: 0x2800
Oper (103): tid: d0bfef23  len: 128  clientid: TRANS  flags: none
AGF Buffer: XAGF  
ver: 1  seq#: 2  len: 65536  
root BNO: 3  CNT: 4
level BNO: 1  CNT: 1
1st: 110  last: 113  cnt: 4  freeblks: 40923  longest: 37466
----------------------------------------------------------------------------
Oper (104): tid: d0bfef23  len: 24  clientid: TRANS  flags: none
BUF:  #regs: 2   start blkno: 1048600 (0x100018)  len: 8  bmap size: 1  flags: 0x2000
Oper (105): tid: d0bfef23  len: 768  clientid: TRANS  flags: none
BUF DATA
 0 42334241 4a000000 ffffffff ffffffff        0 18001000  f010000 ff300000 
 8 fc1ace01 2041ddce fb058d8d f90a26ee  2000000 128ec3ad  50a0000 5d000000 
10 770a0000 21000000 a00a0000 b5000000 580b0000 2d010000 fa0c0000 10000000 
	
/* extent (770a0000 21000000) recorded in the AGF */

18 170d0000  b000000 3e0d0000 1a000000 710d0000  5000000 a60d0000  3000000 
20 ee0d0000 91000000 bf0e0000  1000000 c80e0000  3000000 d60e0000 62010000 
28 47100000  a000000 68100000  6000000 89500000 a0000000 94510000  4000000 
30 c2510000 60000000 3e520000 31000000 8a520000 1e000000 b0520000  1000000 
38 ce520000 1e000000 ee520000 11000000  8530000 18000000 14540000 23000000 
40 70540000 2c000000 a3540000 19000000 d2540000 13000000  9550000 19000000 
48 35550000  2000000 61550000  e000000 81550000 11000000 a0550000  4000000 
50 b6550000  6000000 d4550000  5000000 df550000  3000000  d560000  8000000 
58 24560000  5000000 7b560000  2000000 c8560000 10000000 df560000 36000000 
60 74570000 20000000 a2570000 5d000000 16580000 13000000 46580000 3a000000 
68 91580000  9000000 3a590000  1000000 6a590000  2000000   5a0000 15000000 
70 525a0000 35000000 e85a0000  1000000 d95b0000  2000000 615d0000 28000000 
78 485e0000 d3000000 2a5f0000  9000000 495f0000 44000000 c15f0000 a9000000 
80 cd600000  c000000 34610000  4000000 99610000  1000000 b9610000 1c000000 
88 dc610000  2000000 e2610000 84000000 77620000 1f000000 95630000 8c000000 
90 77640000 13000000 73650000 71000000 7a690000  5000000 8e690000  6000000 
98 576a0000 26000000 376b0000 3a000000 ce6b0000 1e000000  66c0000 a7000000 
a0 a66d0000 5a920000 336d0000 cd920000 336d0000 cd920000 336d0000 cd920000 
a8        0        0        0        0        0        0        0        0 
b0        0        0        0        0        0        0        0        0 
b8        0        0        0        0        0        0        0        0 

----------------------------------------------------------------------------
Oper (106): tid: d0bfef23  len: 24  clientid: TRANS  flags: none
BUF:  #regs: 2   start blkno: 1048608 (0x100020)  len: 8  bmap size: 1  flags: 0x2000
Oper (107): tid: d0bfef23  len: 768  clientid: TRANS  flags: none
BUF DATA
 0 43334241 4a000000 ffffffff ffffffff        0 20001000  f010000 ff300000 
 8 fc1ace01 2041ddce fb058d8d f90a26ee  2000000 3ea37604 bf0e0000  1000000 
10 b0520000  1000000 3a590000  1000000 e85a0000  1000000 99610000  1000000 
18 35550000  2000000 7b560000  2000000 6a590000  2000000 d95b0000  2000000 
20 dc610000  2000000 a60d0000  3000000 c80e0000  3000000 df550000  3000000 
28 94510000  4000000 a0550000  4000000 34610000  4000000 710d0000  5000000 
30 d4550000  5000000 24560000  5000000 7a690000  5000000 68100000  6000000 
38 b6550000  6000000 8e690000  6000000  d560000  8000000 91580000  9000000 
40 2a5f0000  9000000 47100000  a000000 170d0000  b000000 cd600000  c000000 
48 61550000  e000000 fa0c0000 10000000 c8560000 10000000 ee520000 11000000 
50 81550000 11000000 d2540000 13000000 16580000 13000000 77640000 13000000 
58   5a0000 15000000  8530000 18000000 a3540000 19000000  9550000 19000000 
60 3e0d0000 1a000000 b9610000 1c000000 8a520000 1e000000 ce520000 1e000000 
68 ce6b0000 1e000000 77620000 1f000000 74570000 20000000 770a0000 21000000 
70 14540000 23000000 576a0000 26000000 615d0000 28000000 70540000 2c000000 
78 3e520000 31000000 525a0000 35000000 df560000 36000000 46580000 3a000000 
80 376b0000 3a000000 495f0000 44000000  50a0000 5d000000 a2570000 5d000000 
88 c2510000 60000000 73650000 71000000 e2610000 84000000 95630000 8c000000 
90 ee0d0000 91000000 89500000 a0000000  66c0000 a7000000 c15f0000 a9000000 
98 a00a0000 b5000000 485e0000 d3000000 580b0000 2d010000 d60e0000 62010000 
a0 a66d0000 5a920000 336d0000 cd920000 336d0000 cd920000 336d0000 cd920000 
a8        0        0        0        0        0        0        0        0 
b0        0        0        0        0        0        0        0        0 
b8        0        0        0        0        0        0        0        0 

----------------------------------------------------------------------------
	......
----------------------------------------------------------------------------
Oper (147): tid: d0bfef23  len: 0  clientid: TRANS  flags: COMMIT 
----------------------------------------------------------------------------
Oper (148): tid: 9a76bd30  len: 0  clientid: TRANS  flags: START 
----------------------------------------------------------------------------
Oper (149): tid: 9a76bd30  len: 16  clientid: TRANS  flags: none
TRAN:     tid: 9a76bd30  num_items: 164
----------------------------------------------------------------------------
	......
----------------------------------------------------------------------------
Oper (248): tid: 9a76bd30  len: 168  clientid: TRANS  flags: CONTINUE 

============================================================================
cycle: 271	version: 2		lsn: 271,14706	tail_lsn: 271,12644
length of Log Record: 10752	prev offset: 14642		num ops: 67
uuid: 01ce1afc-cedd-4120-8d8d-05fbee260af9   format: little endian linux
h_size: 32768
----------------------------------------------------------------------------
Oper (0): tid: 9a76bd30  len: 8  clientid: TRANS  flags: WAS_CONT END 
Left over region from split log item
----------------------------------------------------------------------------
Oper (1): tid: 9a76bd30  len: 52  clientid: TRANS  flags: none
Left over region from split log item
----------------------------------------------------------------------------
	......
----------------------------------------------------------------------------
Oper (5): tid: 9a76bd30  len: 56  clientid: TRANS  flags: none
INODE: #regs: 4   ino: 0xc4aa3  flags: 0x45   dsize: 32
        blkno: 805536  len: 32  boff: 1536
Oper (6): tid: 9a76bd30  len: 176  clientid: TRANS  flags: none
INODE CORE
magic 0x494e mode 0100666 version 3 format 2
nlink 0 uid 0 gid 0
atime 0xf4e300 mtime 0x2261000 ctime 0x2a02200
size 0x1a9600 nblocks 0x3f extsize 0x0 nextents 0x2

/* inode 0xc4aa3 nblocks is 0x3f */

naextents 0x0 forkoff 24 dmevmask 0x0 dmstate 0x0
flags 0x0 gen 0x526bda01
flags2 0x8 cowextsize 0x0
Oper (7): tid: 9a76bd30  len: 32  clientid: TRANS  flags: none
EXTENTS inode data
Oper (8): tid: 9a76bd30  len: 144  clientid: TRANS  flags: none
LOCAL attr data
----------------------------------------------------------------------------
	......
----------------------------------------------------------------------------
Oper (57): tid: 9a76bd30  len: 24  clientid: TRANS  flags: none
BUF:  #regs: 2   start blkno: 1048577 (0x100001)  len: 1  bmap size: 1  flags: 0x2800
Oper (58): tid: 9a76bd30  len: 128  clientid: TRANS  flags: none
AGF Buffer: XAGF  
ver: 1  seq#: 2  len: 65536  
root BNO: 3  CNT: 4
level BNO: 1  CNT: 1
1st: 114  last: 117  cnt: 4  freeblks: 40714  longest: 37466
----------------------------------------------------------------------------
Oper (59): tid: 9a76bd30  len: 24  clientid: TRANS  flags: none
BUF:  #regs: 2   start blkno: 1048608 (0x100020)  len: 8  bmap size: 1  flags: 0x2000
Oper (60): tid: 9a76bd30  len: 768  clientid: TRANS  flags: none
BUF DATA
 0 43334241 47000000 ffffffff ffffffff        0 20001000  f010000 ff300000 
 8 fc1ace01 2041ddce fb058d8d f90a26ee  2000000 3ea37604 b0520000  1000000 
10 3a590000  1000000 e85a0000  1000000 99610000  1000000 dc610000  1000000 
18 35550000  2000000 7b560000  2000000 6a590000  2000000 d95b0000  2000000 
20 a60d0000  3000000 c80e0000  3000000 df550000  3000000 34610000  4000000 
28 710d0000  5000000 d4550000  5000000 24560000  5000000 7a690000  5000000 
30 68100000  6000000 b6550000  6000000 8e690000  6000000  d560000  8000000 
38 91580000  9000000 2a5f0000  9000000 5d620000  9000000 47100000  a000000 
40 170d0000  b000000 cd600000  c000000 580b0000  d000000 61550000  e000000 
48 fa0c0000 10000000 d5540000 10000000 c8560000 10000000 ee520000 11000000 
50 81550000 11000000 16580000 13000000 77640000 13000000   5a0000 15000000 
58  8530000 18000000 a3540000 19000000  9550000 19000000 3e0d0000 1a000000 
60 b9610000 1c000000 8a520000 1e000000 ce520000 1e000000 ce6b0000 1e000000 
68 77620000 1f000000 74570000 20000000 14540000 23000000 576a0000 26000000 
70 615d0000 28000000 70540000 2c000000 3e520000 31000000 525a0000 35000000 
78 df560000 36000000 46580000 3a000000 376b0000 3a000000 495f0000 44000000 
80 c6510000 5c000000  50a0000 5d000000 a2570000 5d000000 73650000 71000000 
88 95630000 8c000000 ee0d0000 91000000 89500000 a0000000  66c0000 a7000000 
90 c15f0000 a9000000 a00a0000 b5000000 485e0000 d3000000 830b0000  2010000 
98 dc0e0000 5c010000 a66d0000 5a920000 a66d0000 5a920000 a66d0000 5a920000 
a0 a66d0000 5a920000 336d0000 cd920000 336d0000 cd920000 336d0000 cd920000 
a8        0        0        0        0        0        0        0        0 
b0        0        0        0        0        0        0        0        0 
b8        0        0        0        0        0        0        0        0 

----------------------------------------------------------------------------
Oper (61): tid: 9a76bd30  len: 24  clientid: TRANS  flags: none
BUF:  #regs: 2   start blkno: 1048600 (0x100018)  len: 8  bmap size: 1  flags: 0x2000
Oper (62): tid: 9a76bd30  len: 768  clientid: TRANS  flags: none
BUF DATA
 0 42334241 47000000 ffffffff ffffffff        0 18001000  f010000 ff300000 
 8 fc1ace01 2041ddce fb058d8d f90a26ee  2000000 128ec3ad  50a0000 5d000000 
10 a00a0000 b5000000 580b0000  d000000 830b0000  2010000 fa0c0000 10000000 
	
/* extent (770a0000 21000000) has been removed from the AGF */

18 170d0000  b000000 3e0d0000 1a000000 710d0000  5000000 a60d0000  3000000 
20 ee0d0000 91000000 c80e0000  3000000 dc0e0000 5c010000 47100000  a000000 
28 68100000  6000000 89500000 a0000000 c6510000 5c000000 3e520000 31000000 
30 8a520000 1e000000 b0520000  1000000 ce520000 1e000000 ee520000 11000000 
38  8530000 18000000 14540000 23000000 70540000 2c000000 a3540000 19000000 
40 d5540000 10000000  9550000 19000000 35550000  2000000 61550000  e000000 
48 81550000 11000000 b6550000  6000000 d4550000  5000000 df550000  3000000 
50  d560000  8000000 24560000  5000000 7b560000  2000000 c8560000 10000000 
58 df560000 36000000 74570000 20000000 a2570000 5d000000 16580000 13000000 
60 46580000 3a000000 91580000  9000000 3a590000  1000000 6a590000  2000000 
68   5a0000 15000000 525a0000 35000000 e85a0000  1000000 d95b0000  2000000 
70 615d0000 28000000 485e0000 d3000000 2a5f0000  9000000 495f0000 44000000 
78 c15f0000 a9000000 cd600000  c000000 34610000  4000000 99610000  1000000 
80 b9610000 1c000000 dc610000  1000000 5d620000  9000000 77620000 1f000000 
88 95630000 8c000000 77640000 13000000 73650000 71000000 7a690000  5000000 
90 8e690000  6000000 576a0000 26000000 376b0000 3a000000 ce6b0000 1e000000 
98  66c0000 a7000000 a66d0000 5a920000 a66d0000 5a920000 a66d0000 5a920000 
a0 a66d0000 5a920000 336d0000 cd920000 336d0000 cd920000 336d0000 cd920000 
a8        0        0        0        0        0        0        0        0 
b0        0        0        0        0        0        0        0        0 
b8        0        0        0        0        0        0        0        0 

----------------------------------------------------------------------------
	......
----------------------------------------------------------------------------
Oper (66): tid: 9a76bd30  len: 0  clientid: TRANS  flags: COMMIT 


> 
> These are valid representations:
> 
>   |------------Record (LSN1)----|-----------------Record (LSN2)---|
>   |----------Trans A------------|-------------Trans B-------------|
> 
>   |------------Record (LSN1)--------------------------------------|
>   |----------Trans A------------|-------------Trans B-------------|
> 
>   |-----------------------------------------------Record (LSN2)---|
>   |----------Trans A------------|-------------Trans B-------------|
> 
> Only in the first case are there two instances of the AGF buf item
> object in the journal (one in each checkpoint). In the latter two
> cases, there is only one copy of the AGF buf log item that contains
> extent X. Indeed, it will not contain extent X, because the CIL
> aggregation results in the addition in trans A being elided by the
> removal in trans B, essentially resulting in the buffer being
> unchanged except for the LSN after recovery.
> 
> As such, I'm really not sure what you are trying to describe here -
> if recovery of the checkpoint at LSN1 fails in any way, we should
> never attempt to recovery the checkpoint at LSN2. If LSN1 recoveres
> entirely successfully, then LSN2 should see the correct state and
> recover appropriately, too. Hence I don't see how the situation you
> are describing arises.
> 
> > After commit 12818d24db8a ("xfs: rework log recovery to submit buffers on
> > LSN boundaries") was introduced, we submit buffers on lsn boundaries during
> > log recovery. 
> 
> Correct - we submit all the changes in a checkpoint for submission
> before we start recovering the next checkpoint. That's because
> checkpoints are supposed to be atomic units of change moving the
> on-disk state from one change set to the next.

Submit buffer on LSN boundaries not means submit buffer on checkpoint
boundaries during recovery. In my understanding, One transaction on disk
corresponds to a checkpoint, there's maybe multiple transaction on disk
share same LSN, so sometimes we should ensure that submit multiple
transation one time in such case. This rule was introduced by commit
12818d24db8a ("xfs: rework log recovery to submit buffers on LSN boundaries")

Now, we can ensure this rule in most case, but we violate the rule when
submitting buffer regardless of error in xlog_do_recovery_pass(), this
pacth try to fix the problem.

If I misunderstand, expect you to point it out. :)

Thanks,
Long Li

> 
> If any error during processing of a checkpoint occurs, we are
> supposed to abort recovery at that checkpoint so we don't create a
> situation where future recovery attempts skip checkpoints that need
> to be recovered.
> 
> It does not matter if we write back the modified buffers from
> partially completed checkpoints - they were successfully recovered
> in their entirity, and so it is safe to write them back knowing that
> the next attempt to recover the failed checkpoint will see a
> matching LSN and skip that buffer item. If writeback fails, then it
> just doesn't matter as the next recovery attempt will see the old
> LSN and recover that buf item again and write it back....
> 
> AFAICT, you're describing things working as they are supposed to,
> and I don't see where the problem you are attempting to fix is yet.
> 
> > The above problem can be avoided under normal paths, but it's
> > not guaranteed under abnormal paths. Consider the following process, if an
> > error was encountered after recover buf item in transaction A and before
> > recover buf item in transaction B, buffers that have been added to
> > buffer_list will still be submitted, this violates the submits rule on lsn
> > boundaries. So buf item in Transaction B cannot be recovered on the next
> > mount due to current lsn of transaction equal to metadata lsn on disk.
> > 
> >   xlog_do_recovery_pass
> >     error = xlog_recover_process
> >       xlog_recover_process_data
> >         ...
> >           xlog_recover_buf_commit_pass2
> >             xlog_recover_do_reg_buffer  //recover buf item in Trans A
> >             xfs_buf_delwri_queue(bp, buffer_list)
> >         ...
> >         ====> Encountered error and returned
> 
> What error is this, and why isn't it a fatal error causing the
> checkpoint recovery to be aborted and the delwri list to be canceled?
> 
> >         ...
> >           xlog_recover_buf_commit_pass2
> >             xlog_recover_do_reg_buffer  //recover buf item in Trans B
> >             xfs_buf_delwri_queue(bp, buffer_list)
> 
> This should never occur as we should be aborting log recovery on the
> first failure, not continuing to process the checkpoint or starting
> to process other checkpoints. Where are we failing to handle an
> error?
> 
> >     if (!list_empty(&buffer_list))
> >       xfs_buf_delwri_submit(&buffer_list); //submit regardless of error
> 
> Yes, that's fine (as per above). Indeed, this is how we handle
> releasing the buffer log item on failure - this goes through IO
> completion and that release the buf log item we added to the buffer
> during recovery for LSN stamping.
> 
> > In order to make sure that submits buffers on lsn boundaries in the
> > abnormal paths, we need to check error status before submit buffers that
> > have been added from the last record processed. If error status exist,
> > buffers in the bufffer_list should be canceled.
> 
> No, it does not need to be cancelled, it just needs to be processed.
> Anything we've fully recovered is safe to write - it's no different
> from having the system crash during AIL writeback having written
> back these buffers and having to recover from part way through this
> checkpoint.
> 
> > Canceling the buffers in the buffer_list directly isn't correct, unlike
> > any other place where write list was canceled, these buffers has been
> > initialized by xfs_buf_item_init() during recovery and held by buf
> > item, buf items will not be released in xfs_buf_delwri_cancel(). If
> > these buffers are submitted successfully, buf items assocated with
> > the buffer will be released in io end process. So releasing buf item
> > in write list cacneling process is needed.
> 
> Yes, that's why we use xfs_buf_delwri_submit() even on error - it
> handles releasing the buffer log item in IO completion handling
> (even on error).
> 
> 
> 
> > Fixes: 50d5c8d8e938 ("xfs: check LSN ordering for v5 superblocks during recovery")
> > Signed-off-by: Long Li <leo.lilong@huawei.com>
> > ---
> >  fs/xfs/xfs_buf.c         |  2 ++
> >  fs/xfs/xfs_log_recover.c | 22 +++++++++++++---------
> >  2 files changed, 15 insertions(+), 9 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> > index 8e5bd50d29fe..6a1b26aaf97e 100644
> > --- a/fs/xfs/xfs_buf.c
> > +++ b/fs/xfs/xfs_buf.c
> > @@ -2075,6 +2075,8 @@ xfs_buf_delwri_cancel(
> >  		xfs_buf_lock(bp);
> >  		bp->b_flags &= ~_XBF_DELWRI_Q;
> >  		xfs_buf_list_del(bp);
> > +		if (bp->b_log_item)
> > +			xfs_buf_item_relse(bp);
> >  		xfs_buf_relse(bp);
> >  	}
> >  }
> 
> I don't think this is a good idea - the delwri does not own the
> buf log item reference, and so this will cause problems with
> anything that already handles buf log item references correctly.
> 
> > diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> > index 1251c81e55f9..2cda6c90890d 100644
> > --- a/fs/xfs/xfs_log_recover.c
> > +++ b/fs/xfs/xfs_log_recover.c
> > @@ -2964,7 +2964,6 @@ xlog_do_recovery_pass(
> >  	char			*offset;
> >  	char			*hbp, *dbp;
> >  	int			error = 0, h_size, h_len;
> > -	int			error2 = 0;
> >  	int			bblks, split_bblks;
> >  	int			hblks, split_hblks, wrapped_hblks;
> >  	int			i;
> > @@ -3203,16 +3202,21 @@ xlog_do_recovery_pass(
> >   bread_err1:
> >  	kmem_free(hbp);
> >  
> > -	/*
> > -	 * Submit buffers that have been added from the last record processed,
> > -	 * regardless of error status.
> > -	 */
> > -	if (!list_empty(&buffer_list))
> > -		error2 = xfs_buf_delwri_submit(&buffer_list);
> > -
> >  	if (error && first_bad)
> >  		*first_bad = rhead_blk;
> >  
> > +	/*
> > +	 * If there are no error, submit buffers that have been added from the
> > +	 * last record processed, othrewise cancel the write list, to ensure
> > +	 * submit buffers on LSN boundaries.
> > +	 */
> > +	if (!list_empty(&buffer_list)) {
> > +		if (error)
> > +			xfs_buf_delwri_cancel(&buffer_list);
> > +		else
> > +			error = xfs_buf_delwri_submit(&buffer_list);
> > +	}
> 
> > +
> >  	/*
> >  	 * Transactions are freed at commit time but transactions without commit
> >  	 * records on disk are never committed. Free any that may be left in the
> > @@ -3226,7 +3230,7 @@ xlog_do_recovery_pass(
> >  			xlog_recover_free_trans(trans);
> >  	}
> >  
> > -	return error ? error : error2;
> > +	return error;
> >  }
> >  
> >  /*
> > -- 
> > 2.31.1
> > 
> > 
> > 
> 
> -- 
> Dave Chinner
> david@fromorbit.com
Dave Chinner Jan. 8, 2024, 11:40 p.m. UTC | #5
On Mon, Jan 08, 2024 at 08:28:19PM +0800, Long Li wrote:
> Hi, Dave
> 
> Thanks for your reply.
> 
> On Mon, Jan 08, 2024 at 09:13:50AM +1100, Dave Chinner wrote:
> > On Thu, Dec 28, 2023 at 08:46:46PM +0800, Long Li wrote:
> > > While performing the IO fault injection test, I caught the following data
> > > corruption report:
> > > 
> > >  XFS (dm-0): Internal error ltbno + ltlen > bno at line 1957 of file fs/xfs/libxfs/xfs_alloc.c.  Caller xfs_free_ag_extent+0x79c/0x1130
> > >  CPU: 3 PID: 33 Comm: kworker/3:0 Not tainted 6.5.0-rc7-next-20230825-00001-g7f8666926889 #214
> > >  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014
> > >  Workqueue: xfs-inodegc/dm-0 xfs_inodegc_worker
> > >  Call Trace:
> > >   <TASK>
> > >   dump_stack_lvl+0x50/0x70
> > >   xfs_corruption_error+0x134/0x150
> > >   xfs_free_ag_extent+0x7d3/0x1130
> > >   __xfs_free_extent+0x201/0x3c0
> > >   xfs_trans_free_extent+0x29b/0xa10
> > >   xfs_extent_free_finish_item+0x2a/0xb0
> > >   xfs_defer_finish_noroll+0x8d1/0x1b40
> > >   xfs_defer_finish+0x21/0x200
> > >   xfs_itruncate_extents_flags+0x1cb/0x650
> > >   xfs_free_eofblocks+0x18f/0x250
> > >   xfs_inactive+0x485/0x570
> > >   xfs_inodegc_worker+0x207/0x530
> > >   process_scheduled_works+0x24a/0xe10
> > >   worker_thread+0x5ac/0xc60
> > >   kthread+0x2cd/0x3c0
> > >   ret_from_fork+0x4a/0x80
> > >   ret_from_fork_asm+0x11/0x20
> > >   </TASK>
> > >  XFS (dm-0): Corruption detected. Unmount and run xfs_repair
> > > 
> > > After analyzing the disk image, it was found that the corruption was
> > > triggered by the fact that extent was recorded in both the inode and AGF
> > > btrees. After a long time of reproduction and analysis, we found that the
> > > root cause of the problem was that the AGF btree block was not recovered.
> > 
> > Why was it not recovered? Because of an injected IO error during
> > recovery?
> 
> The reason why the buf item of AGF btree is not recovery is that the LSN
> of AGF btree block equal to the current LSN of the recovery item, Because
> log recovery skips items with a metadata LSN >= the current LSN of the 
> recovery item.

Which is the defined behaviour of the LSN sequencing fields. This is
not an issue - the bug lies elsewhere.

> Injected IO error during recovery cause that the LSN of AGF btree block
> equal to the current LSN of the recovery item,

Where is this IO error being injected? You haven't actually
explained where in the recovery process the error occurs that
exposes the issue.

> that's happend in the
> situation of two transaction on disk share same LSN and both modify the
> same buffer. Detailed information can be found below.

That is a normal situation - we can pack multiple async checkpoints
to a single iclog, so I really  don't see how injecting IO errors is
in any way related to creating this problem....

> > > Consider the following situation, Transaction A and Transaction B are in
> > > the same record, so Transaction A and Transaction B share the same LSN1.
> > > If the buf item in Transaction A has been recovered, then the buf item in
> > > Transaction B cannot be recovered, because log recovery skips items with a
> > > metadata LSN >= the current LSN of the recovery item.
> > 
> > This makes no sense to me. Transactions don't exist in the journal;
> > they are purely in-memory constructs that are aggregated
> > in memory (in the CIL) before being written to disk as an atomic
> > checkpoint. Hence a log item can only appear once in a checkpoint
> > regardless of how many transactions it is modified in memory between
> > CIL checkpoints.
> > 
> > > If there is still an
> > > inode item in transaction B that records the Extent X, the Extent X will
> > > be recorded in both the inode and the AGF btree block after transaction B
> > > is recovered.
> > 
> > That transaction should record both the addition to the inode BMBT
> > and the removal from the AGF. Hence if transaction B is recovered in
> > full with no errors, this should not occur.
> > 
> > > 
> > >   |------------Record (LSN1)------------------|---Record (LSN2)---|
> > >   |----------Trans A------------|-------------Trans B-------------|
> > >   |     Buf Item(Extent X)      | Buf Item / Inode item(Extent X) |
> > >   |     Extent X is freed       |     Extent X is allocated       |
> > 
> > This looks wrong. A transaction can only exist in a single CIL
> > checkpoint and everything in a checkpoint has the same LSN. Hence we
> > cannot have the situation where trans B spans two different
> > checkpoints and hence span LSNs.
> 
> There is some misunderstanding here. Transactions that I said is on disk, not
> in memory.  Each transaction on disk corresponds to a checkpoint(This is my 
> understanding, or we can call it as checkpoint transaction just like <<XFS
> Algorithms & Data Structures>>), The two are easily confused, and their
> meanings are not the same.

I have no idea what "<<XFS Algorithms & Data Structures>>" is. Can
you please provide a link?

As it is, the differences between transactions in XFS and journal
checkpointing is fully described in
Documentation/filesystems/xfs-delayed-logging-design.rst.

Just because we used the on disk journal format structure called a
"transaction record" to implement checkpoints for delayed logging
does not mean that the information on disk is a transaction. This
was an implementation convenience that avoided the need to change
the on-disk format for delayed logging. The information held in the
journal is a checkpoint....

> The transaction on disk can spans two different record. The following logs
> show the details:
> 
> //Trans A, tid d0bfef23
> //Trans B, tid 9a76bd30 
>
> ============================================================================
> cycle: 271	version: 2		lsn: 271,14642	tail_lsn: 271,12644
> length of Log Record: 32256	prev offset: 14608		num ops: 249
> uuid: 01ce1afc-cedd-4120-8d8d-05fbee260af9   format: little endian linux
> h_size: 32768
> ----------------------------------------------------------------------------
> Oper (0): tid: d0bfef23  len: 0  clientid: TRANS  flags: START 
> ----------------------------------------------------------------------------
> Oper (1): tid: d0bfef23  len: 16  clientid: TRANS  flags: none
> TRAN:     tid: d0bfef23  num_items: 145
> ----------------------------------------------------------------------------

Yup, that is the start of checkpoint A, at LSN 271,14642.

> 	......
> ----------------------------------------------------------------------------
> Oper (102): tid: d0bfef23  len: 24  clientid: TRANS  flags: none
> BUF:  #regs: 2   start blkno: 1048577 (0x100001)  len: 1  bmap size: 1  flags: 0x2800
> Oper (103): tid: d0bfef23  len: 128  clientid: TRANS  flags: none
> AGF Buffer: XAGF  
> ver: 1  seq#: 2  len: 65536  
> root BNO: 3  CNT: 4
> level BNO: 1  CNT: 1
> 1st: 110  last: 113  cnt: 4  freeblks: 40923  longest: 37466
> ----------------------------------------------------------------------------
> Oper (104): tid: d0bfef23  len: 24  clientid: TRANS  flags: none
> BUF:  #regs: 2   start blkno: 1048600 (0x100018)  len: 8  bmap size: 1  flags: 0x2000
> Oper (105): tid: d0bfef23  len: 768  clientid: TRANS  flags: none
> BUF DATA
>  0 42334241 4a000000 ffffffff ffffffff        0 18001000  f010000 ff300000 
     ^^^^^^^^^						    ^^^^^^^^^^^^^^^^
     magic						    lsn

#define XFS_ABTB_CRC_MAGIC      0x41423342      /* 'AB3B' */

and the LSN is 0x10f000030ff or (271,12543)

>  8 fc1ace01 2041ddce fb058d8d f90a26ee  2000000 128ec3ad  50a0000 5d000000 
> 10 770a0000 21000000 a00a0000 b5000000 580b0000 2d010000 fa0c0000 10000000 
> 	
> /* extent (770a0000 21000000) recorded in the AGF */

It's been recorded in the by-blocknumber freespace btree block, not
the AGF.

> 	......
> ----------------------------------------------------------------------------
> Oper (147): tid: d0bfef23  len: 0  clientid: TRANS  flags: COMMIT 
> ----------------------------------------------------------------------------

And there is the end of checkpoint A. It gets recovered when this
COMMIT ophdr is processed and the buffer list gets submitted because
log->l_recovery_lsn != trans->r_lsn.

The problem then arises here:


> Oper (148): tid: 9a76bd30  len: 0  clientid: TRANS  flags: START 
> ----------------------------------------------------------------------------
> Oper (149): tid: 9a76bd30  len: 16  clientid: TRANS  flags: none
> TRAN:     tid: 9a76bd30  num_items: 164
> ----------------------------------------------------------------------------

At the start of checkpoint B. The START ophdr was formatted into the
same iclog as the start/finish of checkpoint A, which means we have
two checkpoints being recovered with the same rhead->h_lsn value.
But we've just submitted the buffer list for checkpoint A, so all
the buffers modified in that checkpoint are going to be stamped with
trans->r_lsn.

And because checkpoint B has the same trans->r_lsn value as
checkpoint A (because they both start in the same record), then
recovery of anything modified in both checkpoint A and checkpoint B
is going to be skipped.

This is not confined to buffers - these are just the messenger. The
same will happen with inodes, dquots and any other metadata object
that is modified in both transactions: the second modification will
be skipped because the LSN in the metadata object matches the
trans->r_lsn value of the second checkpoint.

This means the problem is not isolated to the delayed write buffer
list - any inode we recover immediate writes the trans->r_lsn into
the inode and then recalculates the CRC. Sure, the buffer is then
added to the delwri list, but the inode LSN in the buffer has
already been updated before the buffer is placed on the delwri list.
Same goes for dquot recovery.

> > > After commit 12818d24db8a ("xfs: rework log recovery to submit buffers on
> > > LSN boundaries") was introduced, we submit buffers on lsn boundaries during
> > > log recovery. 
> > 
> > Correct - we submit all the changes in a checkpoint for submission
> > before we start recovering the next checkpoint. That's because
> > checkpoints are supposed to be atomic units of change moving the
> > on-disk state from one change set to the next.
> 
> Submit buffer on LSN boundaries not means submit buffer on checkpoint
> boundaries during recovery. In my understanding, One transaction on disk
> corresponds to a checkpoint, there's maybe multiple transaction on disk
> share same LSN, so sometimes we should ensure that submit multiple
> transation one time in such case.  This rule was introduced by commit
> 12818d24db8a ("xfs: rework log recovery to submit buffers on LSN boundaries")

Well, yes, that's exactly the situation that commit 12818d24db8a was
intended to handle:

    "If independent transactions share an LSN and both modify the
    same buffer, log recovery can incorrectly skip updates and leave
    the filesystem in an inconsisent state."

Unfortunately, we didn't take into account the complexity of
mutliple transactions sharing the same start LSN in commit
12818d24db8a ("xfs: rework log recovery to submit buffers on LSN
boundaries") back in 2016.

Indeed, we didn't even know that there was a reliance on strict
start record LSN ordering in journal recovery until 2021:

commit 68a74dcae6737c27b524b680e070fe41f0cad43a
Author: Dave Chinner <dchinner@redhat.com>
Date:   Tue Aug 10 18:00:44 2021 -0700

    xfs: order CIL checkpoint start records
    
    Because log recovery depends on strictly ordered start records as
    well as strictly ordered commit records.
    
    This is a zero day bug in the way XFS writes pipelined transactions
    to the journal which is exposed by fixing the zero day bug that
    prevents the CIL from pipelining checkpoints. This re-introduces
    explicit concurrent commits back into the on-disk journal and hence
    out of order start records.
    
    The XFS journal commit code has never ordered start records and we
    have relied on strict commit record ordering for correct recovery
    ordering of concurrently written transactions. Unfortunately, root
    cause analysis uncovered the fact that log recovery uses the LSN of
    the start record for transaction commit processing. Hence, whilst
    the commits are processed in strict order by recovery, the LSNs
    associated with the commits can be out of order and so recovery may
    stamp incorrect LSNs into objects and/or misorder intents in the AIL
    for later processing. This can result in log recovery failures
    and/or on disk corruption, sometimes silent.
    
    Because this is a long standing log recovery issue, we can't just
    fix log recovery and call it good. This still leaves older kernels
    susceptible to recovery failures and corruption when replaying a log
    from a kernel that pipelines checkpoints. There is also the issue
    that in-memory ordering for AIL pushing and data integrity
    operations are based on checkpoint start LSNs, and if the start LSN
    is incorrect in the journal, it is also incorrect in memory.
    
    Hence there's really only one choice for fixing this zero-day bug:
    we need to strictly order checkpoint start records in ascending
    sequence order in the log, the same way we already strictly order
    commit records.
    
    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>

Essentially, the problem now is that even with strictly ordered
start records for checkpoints, checkpoints with the same start LSN
interfere with each other in recovery because recovery is not
aware of the fact that we can have multiple checkpoints that start
with the same LSN.

This is another zero-day issue with the journal and log recovery;
originally there was no "anti-recovery" logic in the journal like we
have now with LSNs to prevent recovery from taking metadata state
backwards.  Hence log recovery just always replayed every change
that was in the journal from start to finish and so there was never
a problem with having multiple start records in the same log record.

However, this was known to cause problems with inodes and data vs
metadata sequencing and non-transactional inode metadata updates
(e.g. inode size), so a "flush iteration" counter was added to
inodes in 2003:

commit 6ed3d868e47470a301b49f1e8626972791206f50
Author: Steve Lord <lord@sgi.com>
Date:   Wed Aug 6 21:17:05 2003 +0000

    Add versioning to the on disk inode which we increment on each
    flush call. This is used during recovery to avoid replaying an
    older copy of the inode from the log. We can do this without
    versioning the filesystem as the pad space we borrowed was
    always zero and will be ignored by old kernels.
    During recovery, do not replay an inode log record which is older
    than the on disk copy. Check for wrapping in the counter.

This was never fully reliable, and there was always issues with
this counter because inode changes weren't always journalled nor
were cache flushes used to ensure unlogged inode metadata updates
reached stable storage.

The LSN sequencing was added to the v5 format to ensure metadata
never goes backwards in time on disk without fail. The issue you've
uncovered shows that we still have issues stemming from the
original journal recovery algorithm that was not designed with
anti-recovery protections in mind from the start.

The problem we need to solve is how we preserve the necessary
anti-recovery behaviour when we have multiple checkpoints that can
have the same LSN and objects are updated immediately on recovery?

I suspect that we need to track that the checkpoint being recovered
has a duplicate start LSN (i.e. in the struct xlog_recover) and
modify the anti-recovery LSN check to take this into account. i.e.
we can really only skip recovery of the first checkpoint at any
given LSN because we cannot disambiguate an LSN updated by the first
checkpoint at that LSN and the metadata already being up to date on
disk in the second and subsequent checkpoints at the same start
LSN.

There are likely to be other solutions - anyone have a different
idea on how we might address this?

-Dave.
Brian Foster Jan. 9, 2024, 2:52 p.m. UTC | #6
On Tue, Jan 09, 2024 at 10:40:35AM +1100, Dave Chinner wrote:
> On Mon, Jan 08, 2024 at 08:28:19PM +0800, Long Li wrote:
> > Hi, Dave
> > 
> > Thanks for your reply.
> > 
> > On Mon, Jan 08, 2024 at 09:13:50AM +1100, Dave Chinner wrote:
> > > On Thu, Dec 28, 2023 at 08:46:46PM +0800, Long Li wrote:
> > > > While performing the IO fault injection test, I caught the following data
> > > > corruption report:
> > > > 
> > > >  XFS (dm-0): Internal error ltbno + ltlen > bno at line 1957 of file fs/xfs/libxfs/xfs_alloc.c.  Caller xfs_free_ag_extent+0x79c/0x1130
> > > >  CPU: 3 PID: 33 Comm: kworker/3:0 Not tainted 6.5.0-rc7-next-20230825-00001-g7f8666926889 #214
> > > >  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014
> > > >  Workqueue: xfs-inodegc/dm-0 xfs_inodegc_worker
> > > >  Call Trace:
> > > >   <TASK>
> > > >   dump_stack_lvl+0x50/0x70
> > > >   xfs_corruption_error+0x134/0x150
> > > >   xfs_free_ag_extent+0x7d3/0x1130
> > > >   __xfs_free_extent+0x201/0x3c0
> > > >   xfs_trans_free_extent+0x29b/0xa10
> > > >   xfs_extent_free_finish_item+0x2a/0xb0
> > > >   xfs_defer_finish_noroll+0x8d1/0x1b40
> > > >   xfs_defer_finish+0x21/0x200
> > > >   xfs_itruncate_extents_flags+0x1cb/0x650
> > > >   xfs_free_eofblocks+0x18f/0x250
> > > >   xfs_inactive+0x485/0x570
> > > >   xfs_inodegc_worker+0x207/0x530
> > > >   process_scheduled_works+0x24a/0xe10
> > > >   worker_thread+0x5ac/0xc60
> > > >   kthread+0x2cd/0x3c0
> > > >   ret_from_fork+0x4a/0x80
> > > >   ret_from_fork_asm+0x11/0x20
> > > >   </TASK>
> > > >  XFS (dm-0): Corruption detected. Unmount and run xfs_repair
> > > > 
> > > > After analyzing the disk image, it was found that the corruption was
> > > > triggered by the fact that extent was recorded in both the inode and AGF
> > > > btrees. After a long time of reproduction and analysis, we found that the
> > > > root cause of the problem was that the AGF btree block was not recovered.
> > > 
> > > Why was it not recovered? Because of an injected IO error during
> > > recovery?
> > 
> > The reason why the buf item of AGF btree is not recovery is that the LSN
> > of AGF btree block equal to the current LSN of the recovery item, Because
> > log recovery skips items with a metadata LSN >= the current LSN of the 
> > recovery item.
> 
> Which is the defined behaviour of the LSN sequencing fields. This is
> not an issue - the bug lies elsewhere.
> 
> > Injected IO error during recovery cause that the LSN of AGF btree block
> > equal to the current LSN of the recovery item,
> 
> Where is this IO error being injected? You haven't actually
> explained where in the recovery process the error occurs that
> exposes the issue.
> 
> > that's happend in the
> > situation of two transaction on disk share same LSN and both modify the
> > same buffer. Detailed information can be found below.
> 
> That is a normal situation - we can pack multiple async checkpoints
> to a single iclog, so I really  don't see how injecting IO errors is
> in any way related to creating this problem....
> 
> > > > Consider the following situation, Transaction A and Transaction B are in
> > > > the same record, so Transaction A and Transaction B share the same LSN1.
> > > > If the buf item in Transaction A has been recovered, then the buf item in
> > > > Transaction B cannot be recovered, because log recovery skips items with a
> > > > metadata LSN >= the current LSN of the recovery item.
> > > 
> > > This makes no sense to me. Transactions don't exist in the journal;
> > > they are purely in-memory constructs that are aggregated
> > > in memory (in the CIL) before being written to disk as an atomic
> > > checkpoint. Hence a log item can only appear once in a checkpoint
> > > regardless of how many transactions it is modified in memory between
> > > CIL checkpoints.
> > > 
> > > > If there is still an
> > > > inode item in transaction B that records the Extent X, the Extent X will
> > > > be recorded in both the inode and the AGF btree block after transaction B
> > > > is recovered.
> > > 
> > > That transaction should record both the addition to the inode BMBT
> > > and the removal from the AGF. Hence if transaction B is recovered in
> > > full with no errors, this should not occur.
> > > 
> > > > 
> > > >   |------------Record (LSN1)------------------|---Record (LSN2)---|
> > > >   |----------Trans A------------|-------------Trans B-------------|
> > > >   |     Buf Item(Extent X)      | Buf Item / Inode item(Extent X) |
> > > >   |     Extent X is freed       |     Extent X is allocated       |
> > > 
> > > This looks wrong. A transaction can only exist in a single CIL
> > > checkpoint and everything in a checkpoint has the same LSN. Hence we
> > > cannot have the situation where trans B spans two different
> > > checkpoints and hence span LSNs.
> > 
> > There is some misunderstanding here. Transactions that I said is on disk, not
> > in memory.  Each transaction on disk corresponds to a checkpoint(This is my 
> > understanding, or we can call it as checkpoint transaction just like <<XFS
> > Algorithms & Data Structures>>), The two are easily confused, and their
> > meanings are not the same.
> 
> I have no idea what "<<XFS Algorithms & Data Structures>>" is. Can
> you please provide a link?
> 
> As it is, the differences between transactions in XFS and journal
> checkpointing is fully described in
> Documentation/filesystems/xfs-delayed-logging-design.rst.
> 
> Just because we used the on disk journal format structure called a
> "transaction record" to implement checkpoints for delayed logging
> does not mean that the information on disk is a transaction. This
> was an implementation convenience that avoided the need to change
> the on-disk format for delayed logging. The information held in the
> journal is a checkpoint....
> 
> > The transaction on disk can spans two different record. The following logs
> > show the details:
> > 
> > //Trans A, tid d0bfef23
> > //Trans B, tid 9a76bd30 
> >
> > ============================================================================
> > cycle: 271	version: 2		lsn: 271,14642	tail_lsn: 271,12644
> > length of Log Record: 32256	prev offset: 14608		num ops: 249
> > uuid: 01ce1afc-cedd-4120-8d8d-05fbee260af9   format: little endian linux
> > h_size: 32768
> > ----------------------------------------------------------------------------
> > Oper (0): tid: d0bfef23  len: 0  clientid: TRANS  flags: START 
> > ----------------------------------------------------------------------------
> > Oper (1): tid: d0bfef23  len: 16  clientid: TRANS  flags: none
> > TRAN:     tid: d0bfef23  num_items: 145
> > ----------------------------------------------------------------------------
> 
> Yup, that is the start of checkpoint A, at LSN 271,14642.
> 
> > 	......
> > ----------------------------------------------------------------------------
> > Oper (102): tid: d0bfef23  len: 24  clientid: TRANS  flags: none
> > BUF:  #regs: 2   start blkno: 1048577 (0x100001)  len: 1  bmap size: 1  flags: 0x2800
> > Oper (103): tid: d0bfef23  len: 128  clientid: TRANS  flags: none
> > AGF Buffer: XAGF  
> > ver: 1  seq#: 2  len: 65536  
> > root BNO: 3  CNT: 4
> > level BNO: 1  CNT: 1
> > 1st: 110  last: 113  cnt: 4  freeblks: 40923  longest: 37466
> > ----------------------------------------------------------------------------
> > Oper (104): tid: d0bfef23  len: 24  clientid: TRANS  flags: none
> > BUF:  #regs: 2   start blkno: 1048600 (0x100018)  len: 8  bmap size: 1  flags: 0x2000
> > Oper (105): tid: d0bfef23  len: 768  clientid: TRANS  flags: none
> > BUF DATA
> >  0 42334241 4a000000 ffffffff ffffffff        0 18001000  f010000 ff300000 
>      ^^^^^^^^^						    ^^^^^^^^^^^^^^^^
>      magic						    lsn
> 
> #define XFS_ABTB_CRC_MAGIC      0x41423342      /* 'AB3B' */
> 
> and the LSN is 0x10f000030ff or (271,12543)
> 
> >  8 fc1ace01 2041ddce fb058d8d f90a26ee  2000000 128ec3ad  50a0000 5d000000 
> > 10 770a0000 21000000 a00a0000 b5000000 580b0000 2d010000 fa0c0000 10000000 
> > 	
> > /* extent (770a0000 21000000) recorded in the AGF */
> 
> It's been recorded in the by-blocknumber freespace btree block, not
> the AGF.
> 
> > 	......
> > ----------------------------------------------------------------------------
> > Oper (147): tid: d0bfef23  len: 0  clientid: TRANS  flags: COMMIT 
> > ----------------------------------------------------------------------------
> 
> And there is the end of checkpoint A. It gets recovered when this
> COMMIT ophdr is processed and the buffer list gets submitted because
> log->l_recovery_lsn != trans->r_lsn.
> 
> The problem then arises here:
> 
> 
> > Oper (148): tid: 9a76bd30  len: 0  clientid: TRANS  flags: START 
> > ----------------------------------------------------------------------------
> > Oper (149): tid: 9a76bd30  len: 16  clientid: TRANS  flags: none
> > TRAN:     tid: 9a76bd30  num_items: 164
> > ----------------------------------------------------------------------------
> 
> At the start of checkpoint B. The START ophdr was formatted into the
> same iclog as the start/finish of checkpoint A, which means we have
> two checkpoints being recovered with the same rhead->h_lsn value.
> But we've just submitted the buffer list for checkpoint A, so all
> the buffers modified in that checkpoint are going to be stamped with
> trans->r_lsn.
> 
> And because checkpoint B has the same trans->r_lsn value as
> checkpoint A (because they both start in the same record), then
> recovery of anything modified in both checkpoint A and checkpoint B
> is going to be skipped.
> 
> This is not confined to buffers - these are just the messenger. The
> same will happen with inodes, dquots and any other metadata object
> that is modified in both transactions: the second modification will
> be skipped because the LSN in the metadata object matches the
> trans->r_lsn value of the second checkpoint.
> 
> This means the problem is not isolated to the delayed write buffer
> list - any inode we recover immediate writes the trans->r_lsn into
> the inode and then recalculates the CRC. Sure, the buffer is then
> added to the delwri list, but the inode LSN in the buffer has
> already been updated before the buffer is placed on the delwri list.
> Same goes for dquot recovery.
> 
> > > > After commit 12818d24db8a ("xfs: rework log recovery to submit buffers on
> > > > LSN boundaries") was introduced, we submit buffers on lsn boundaries during
> > > > log recovery. 
> > > 
> > > Correct - we submit all the changes in a checkpoint for submission
> > > before we start recovering the next checkpoint. That's because
> > > checkpoints are supposed to be atomic units of change moving the
> > > on-disk state from one change set to the next.
> > 
> > Submit buffer on LSN boundaries not means submit buffer on checkpoint
> > boundaries during recovery. In my understanding, One transaction on disk
> > corresponds to a checkpoint, there's maybe multiple transaction on disk
> > share same LSN, so sometimes we should ensure that submit multiple
> > transation one time in such case.  This rule was introduced by commit
> > 12818d24db8a ("xfs: rework log recovery to submit buffers on LSN boundaries")
> 
> Well, yes, that's exactly the situation that commit 12818d24db8a was
> intended to handle:
> 
>     "If independent transactions share an LSN and both modify the
>     same buffer, log recovery can incorrectly skip updates and leave
>     the filesystem in an inconsisent state."
> 
> Unfortunately, we didn't take into account the complexity of
> mutliple transactions sharing the same start LSN in commit
> 12818d24db8a ("xfs: rework log recovery to submit buffers on LSN
> boundaries") back in 2016.
> 
> Indeed, we didn't even know that there was a reliance on strict
> start record LSN ordering in journal recovery until 2021:
> 
> commit 68a74dcae6737c27b524b680e070fe41f0cad43a
> Author: Dave Chinner <dchinner@redhat.com>
> Date:   Tue Aug 10 18:00:44 2021 -0700
> 
>     xfs: order CIL checkpoint start records
>     
>     Because log recovery depends on strictly ordered start records as
>     well as strictly ordered commit records.
>     
>     This is a zero day bug in the way XFS writes pipelined transactions
>     to the journal which is exposed by fixing the zero day bug that
>     prevents the CIL from pipelining checkpoints. This re-introduces
>     explicit concurrent commits back into the on-disk journal and hence
>     out of order start records.
>     
>     The XFS journal commit code has never ordered start records and we
>     have relied on strict commit record ordering for correct recovery
>     ordering of concurrently written transactions. Unfortunately, root
>     cause analysis uncovered the fact that log recovery uses the LSN of
>     the start record for transaction commit processing. Hence, whilst
>     the commits are processed in strict order by recovery, the LSNs
>     associated with the commits can be out of order and so recovery may
>     stamp incorrect LSNs into objects and/or misorder intents in the AIL
>     for later processing. This can result in log recovery failures
>     and/or on disk corruption, sometimes silent.
>     
>     Because this is a long standing log recovery issue, we can't just
>     fix log recovery and call it good. This still leaves older kernels
>     susceptible to recovery failures and corruption when replaying a log
>     from a kernel that pipelines checkpoints. There is also the issue
>     that in-memory ordering for AIL pushing and data integrity
>     operations are based on checkpoint start LSNs, and if the start LSN
>     is incorrect in the journal, it is also incorrect in memory.
>     
>     Hence there's really only one choice for fixing this zero-day bug:
>     we need to strictly order checkpoint start records in ascending
>     sequence order in the log, the same way we already strictly order
>     commit records.
>     
>     Signed-off-by: Dave Chinner <dchinner@redhat.com>
>     Reviewed-by: Darrick J. Wong <djwong@kernel.org>
>     Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> 
> Essentially, the problem now is that even with strictly ordered
> start records for checkpoints, checkpoints with the same start LSN
> interfere with each other in recovery because recovery is not
> aware of the fact that we can have multiple checkpoints that start
> with the same LSN.
> 
> This is another zero-day issue with the journal and log recovery;
> originally there was no "anti-recovery" logic in the journal like we
> have now with LSNs to prevent recovery from taking metadata state
> backwards.  Hence log recovery just always replayed every change
> that was in the journal from start to finish and so there was never
> a problem with having multiple start records in the same log record.
> 
> However, this was known to cause problems with inodes and data vs
> metadata sequencing and non-transactional inode metadata updates
> (e.g. inode size), so a "flush iteration" counter was added to
> inodes in 2003:
> 
> commit 6ed3d868e47470a301b49f1e8626972791206f50
> Author: Steve Lord <lord@sgi.com>
> Date:   Wed Aug 6 21:17:05 2003 +0000
> 
>     Add versioning to the on disk inode which we increment on each
>     flush call. This is used during recovery to avoid replaying an
>     older copy of the inode from the log. We can do this without
>     versioning the filesystem as the pad space we borrowed was
>     always zero and will be ignored by old kernels.
>     During recovery, do not replay an inode log record which is older
>     than the on disk copy. Check for wrapping in the counter.
> 
> This was never fully reliable, and there was always issues with
> this counter because inode changes weren't always journalled nor
> were cache flushes used to ensure unlogged inode metadata updates
> reached stable storage.
> 
> The LSN sequencing was added to the v5 format to ensure metadata
> never goes backwards in time on disk without fail. The issue you've
> uncovered shows that we still have issues stemming from the
> original journal recovery algorithm that was not designed with
> anti-recovery protections in mind from the start.
> 
> The problem we need to solve is how we preserve the necessary
> anti-recovery behaviour when we have multiple checkpoints that can
> have the same LSN and objects are updated immediately on recovery?
> 
> I suspect that we need to track that the checkpoint being recovered
> has a duplicate start LSN (i.e. in the struct xlog_recover) and
> modify the anti-recovery LSN check to take this into account. i.e.
> we can really only skip recovery of the first checkpoint at any
> given LSN because we cannot disambiguate an LSN updated by the first
> checkpoint at that LSN and the metadata already being up to date on
> disk in the second and subsequent checkpoints at the same start
> LSN.
> 
> There are likely to be other solutions - anyone have a different
> idea on how we might address this?
> 

It's been a while since I've looked at any of this and I haven't waded
through all of the details, so I could easily be missing something, but
what exactly is wrong with the approach of the patch as posted?

Commit 12818d24db ("xfs: rework log recovery to submit buffers on LSN
boundaries") basically created a new invariant for log recovery where
buffers are allowed to be written only once per LSN. The risk otherwise
is that a subsequent update with a matching LSN would not be correctly
applied due to the v5 LSN ordering rules. Since log recovery processes
transactions (using terminology/granularity as defined by the
implementation of xlog_recover_commit_trans()), this required changes to
accommodate any of the various possible runtime logging scenarios that
could cause a buffer to have multiple entries in the log associated with
a single LSN, the details of which were orthogonal to the fix.

The functional change therefore was that rather than to process and
submit "transactions" in sequence during recovery, the pending buffer
list was lifted to a higher level in the code, a tracking field was
added for the "current LSN" of log recovery, and only once we cross a
current LSN boundary are we allowed to submit the set of buffers
processed for the prior LSN. The reason for this logic is that seeing
the next LSN was really the only way we know we're done processing items
for a particular LSN.

If I understand the problem description correctly, the issue here is
that if an error is encountered in the middle of processing items for
some LSN A, we bail out of recovery and submit the pending buffers on
the way out. If we haven't completed processing all items for LSN A
before failing, however, then we've just possibly violated the "write
once per LSN" invariant that protects from corrupting the fs. This is
because the writeback permanently updates metadata LSNs (assuming that
I/O doesn't fail), which means if recovery retries from the same point
the next time around and progresses to find a second instance of an
already written buffer in LSN A, it will exhibit the same general
behavior from before the write once invariant was introduced. IOW,
there's still a vector to the original problematic multi-write per LSN
behavior through multiple recovery attempts (hence the simulated I/O
error to reproduce).

Long Li, am I following the problem description correctly? I've not
fully reviewed it, but if so, the proposed solution seems fairly sane
and logical to me. (And nice work tracking this down, BTW, regardless of
whether this is the final solution. ;).

Brian

> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
>
Dave Chinner Jan. 9, 2024, 9:43 p.m. UTC | #7
On Tue, Jan 09, 2024 at 09:52:37AM -0500, Brian Foster wrote:
> > 
> > The problem we need to solve is how we preserve the necessary
> > anti-recovery behaviour when we have multiple checkpoints that can
> > have the same LSN and objects are updated immediately on recovery?
> > 
> > I suspect that we need to track that the checkpoint being recovered
> > has a duplicate start LSN (i.e. in the struct xlog_recover) and
> > modify the anti-recovery LSN check to take this into account. i.e.
> > we can really only skip recovery of the first checkpoint at any
> > given LSN because we cannot disambiguate an LSN updated by the first
> > checkpoint at that LSN and the metadata already being up to date on
> > disk in the second and subsequent checkpoints at the same start
> > LSN.
> > 
> > There are likely to be other solutions - anyone have a different
> > idea on how we might address this?
> > 
> 
> It's been a while since I've looked at any of this and I haven't waded
> through all of the details, so I could easily be missing something, but
> what exactly is wrong with the approach of the patch as posted?

That it fails to address the fact that the code as implemented
violates the "only submit buffers on LSN change" invariant. Hence we
have silent failure to recover of the second set of changes
to a log item  recorded in the multiple checkpoints that have the
same start LSN.

The original problem described in the commit - a shutdown due to a
freespace btree record corruption - has been something we've seen
semi-regularly for a few years now. We've never got to the
bottom of the problem because we've lacked a reliable reproducer for
the issue.

The analysis and debug information provided by out by Long indicates
that when multiple checkpoints start at the same LSN, the objects in
the later checkpoints (based on commit record ordering) won't get
replayed because the LSN in the object has already been updated by
the first checkpoint. Hence they skip recovery in the second (and
subsequent) checkpoints at the same start LSN.

In a lot of these cases, the object will be logged again later in
the recovery process, thereby overwriting the corruption caused by
skipping a checkpointed update. Hence this will only be exposed in
normal situations if the silent recovery failure occurs on the last
modification of the object in the journal.

This is why it's a rare failure to be seen in production systems,
but it is something that hindsight tells us has been occurring given
the repeated reports of unexplainable single record free space btree
corruption we've had over the past few years.

> Commit 12818d24db ("xfs: rework log recovery to submit buffers on LSN
> boundaries") basically created a new invariant for log recovery where
> buffers are allowed to be written only once per LSN. The risk otherwise
> is that a subsequent update with a matching LSN would not be correctly
> applied due to the v5 LSN ordering rules. Since log recovery processes
> transactions (using terminology/granularity as defined by the
> implementation of xlog_recover_commit_trans()), this required changes to
> accommodate any of the various possible runtime logging scenarios that
> could cause a buffer to have multiple entries in the log associated with
> a single LSN, the details of which were orthogonal to the fix.
> 
> The functional change therefore was that rather than to process and
> submit "transactions" in sequence during recovery, the pending buffer
> list was lifted to a higher level in the code, a tracking field was
> added for the "current LSN" of log recovery, and only once we cross a
> current LSN boundary are we allowed to submit the set of buffers
> processed for the prior LSN. The reason for this logic is that seeing
> the next LSN was really the only way we know we're done processing items
> for a particular LSN.

Yes, and therein lies one of the problems with the current
implementation - this "lsn has changed" logic is incorrect.

> If I understand the problem description correctly, the issue here is
> that if an error is encountered in the middle of processing items for
> some LSN A, we bail out of recovery and submit the pending buffers on
> the way out.  If we haven't completed processing all items for LSN A
> before failing, however, then we've just possibly violated the "write
> once per LSN" invariant that protects from corrupting the fs.

The error handling and/or repeated runs of log recovery simply
exposes the problem - these symptoms are not the problem that needs
to be fixed.

The issue is that the code as it stands doesn't handle object
recovery from multiple checkpoints with the same start lsn. The
easiest way to understand this is to look at the buffer submit logic
on completion of a checkpoint:

	if (log->l_recovery_lsn != trans->r_lsn &&
            ohead->oh_flags & XLOG_COMMIT_TRANS) {
                error = xfs_buf_delwri_submit(buffer_list);
                if (error)
                        return error;
                log->l_recovery_lsn = trans->r_lsn;
        }

This submits the buffer list on the first checkpoint that completes
with a new start LSN, not when all the checkpoints with the same
start LSN complete. i.e.:

checkpoint  start LSN	commit lsn	submission on commit record
A		32	  63		buffer list for A
B		64	  68		buffer list for B
C		64	  92		nothing, start lsn unchanged
D		64	 127		nothing, start lsn unchanged
E		128	 192		buffer list for C, D and E

IOWs, the invariant "don't submit buffers until LSN changes" is not
actually implemented correctly by this code. This is the obvious
aspect of the problem, but addressing buffer submission doesn't
actually fix the problem.

That is, changing buffer submission to be correct doesn't address
the fact that we've already done things like updated the LSN in
inodes and dquots during recovery of those objects. Hence,
regardless of whether we submit the buffers or not, changes to
non-buffer objects in checkpoints C and D will never get recovered
directly if they were originally modified in checkpoint B.

This is the problem we need to address: if we have multiple
checkpoints at the same start LSN, we need to ensure that all the
changes to any object in any of the checkpoints at that start LSN
are recovered. This is what we are not doing, and this is the root
cause of the problem....

-Dave.
Long Li Jan. 10, 2024, 7:03 a.m. UTC | #8
On Wed, Jan 10, 2024 at 08:43:31AM +1100, Dave Chinner wrote:
> On Tue, Jan 09, 2024 at 09:52:37AM -0500, Brian Foster wrote:
> > > 
> > > The problem we need to solve is how we preserve the necessary
> > > anti-recovery behaviour when we have multiple checkpoints that can
> > > have the same LSN and objects are updated immediately on recovery?
> > > 
> > > I suspect that we need to track that the checkpoint being recovered
> > > has a duplicate start LSN (i.e. in the struct xlog_recover) and
> > > modify the anti-recovery LSN check to take this into account. i.e.
> > > we can really only skip recovery of the first checkpoint at any
> > > given LSN because we cannot disambiguate an LSN updated by the first
> > > checkpoint at that LSN and the metadata already being up to date on
> > > disk in the second and subsequent checkpoints at the same start
> > > LSN.
> > > 
> > > There are likely to be other solutions - anyone have a different
> > > idea on how we might address this?
> > > 
> > 
> > It's been a while since I've looked at any of this and I haven't waded
> > through all of the details, so I could easily be missing something, but
> > what exactly is wrong with the approach of the patch as posted?
> 
> That it fails to address the fact that the code as implemented
> violates the "only submit buffers on LSN change" invariant. Hence we
> have silent failure to recover of the second set of changes
> to a log item  recorded in the multiple checkpoints that have the
> same start LSN.
> 
> The original problem described in the commit - a shutdown due to a
> freespace btree record corruption - has been something we've seen
> semi-regularly for a few years now. We've never got to the
> bottom of the problem because we've lacked a reliable reproducer for
> the issue.
> 
> The analysis and debug information provided by out by Long indicates
> that when multiple checkpoints start at the same LSN, the objects in
> the later checkpoints (based on commit record ordering) won't get
> replayed because the LSN in the object has already been updated by
> the first checkpoint. Hence they skip recovery in the second (and
> subsequent) checkpoints at the same start LSN.
> 
> In a lot of these cases, the object will be logged again later in
> the recovery process, thereby overwriting the corruption caused by
> skipping a checkpointed update. Hence this will only be exposed in
> normal situations if the silent recovery failure occurs on the last
> modification of the object in the journal.
> 
> This is why it's a rare failure to be seen in production systems,
> but it is something that hindsight tells us has been occurring given
> the repeated reports of unexplainable single record free space btree
> corruption we've had over the past few years.
> 
> > Commit 12818d24db ("xfs: rework log recovery to submit buffers on LSN
> > boundaries") basically created a new invariant for log recovery where
> > buffers are allowed to be written only once per LSN. The risk otherwise
> > is that a subsequent update with a matching LSN would not be correctly
> > applied due to the v5 LSN ordering rules. Since log recovery processes
> > transactions (using terminology/granularity as defined by the
> > implementation of xlog_recover_commit_trans()), this required changes to
> > accommodate any of the various possible runtime logging scenarios that
> > could cause a buffer to have multiple entries in the log associated with
> > a single LSN, the details of which were orthogonal to the fix.
> > 
> > The functional change therefore was that rather than to process and
> > submit "transactions" in sequence during recovery, the pending buffer
> > list was lifted to a higher level in the code, a tracking field was
> > added for the "current LSN" of log recovery, and only once we cross a
> > current LSN boundary are we allowed to submit the set of buffers
> > processed for the prior LSN. The reason for this logic is that seeing
> > the next LSN was really the only way we know we're done processing items
> > for a particular LSN.
> 
> Yes, and therein lies one of the problems with the current
> implementation - this "lsn has changed" logic is incorrect.
> 
> > If I understand the problem description correctly, the issue here is
> > that if an error is encountered in the middle of processing items for
> > some LSN A, we bail out of recovery and submit the pending buffers on
> > the way out.  If we haven't completed processing all items for LSN A
> > before failing, however, then we've just possibly violated the "write
> > once per LSN" invariant that protects from corrupting the fs.
> 
> The error handling and/or repeated runs of log recovery simply
> exposes the problem - these symptoms are not the problem that needs
> to be fixed.
> 
> The issue is that the code as it stands doesn't handle object
> recovery from multiple checkpoints with the same start lsn. The
> easiest way to understand this is to look at the buffer submit logic
> on completion of a checkpoint:
> 
> 	if (log->l_recovery_lsn != trans->r_lsn &&
>             ohead->oh_flags & XLOG_COMMIT_TRANS) {
>                 error = xfs_buf_delwri_submit(buffer_list);
>                 if (error)
>                         return error;
>                 log->l_recovery_lsn = trans->r_lsn;
>         }
> 
> This submits the buffer list on the first checkpoint that completes
> with a new start LSN, not when all the checkpoints with the same
> start LSN complete. i.e.:
> 
> checkpoint  start LSN	commit lsn	submission on commit record
> A		32	  63		buffer list for A
> B		64	  68		buffer list for B
> C		64	  92		nothing, start lsn unchanged
> D		64	 127		nothing, start lsn unchanged
> E		128	 192		buffer list for C, D and E
> 

I have different understanding about this code. In the first checkpoint's
handle on commit record, buffer_list is empty and l_recovery_lsn update to
the first checkpoint's lsn, the result is that each checkpoint's submit
logic try to submit the buffers which was added to buffer list in checkpoint
recovery of previous LSN.

  xlog_do_recovery_pass
    LIST_HEAD (buffer_list);
    xlog_recover_process
      xlog_recover_process_data
        xlog_recover_process_ophdr
          xlog_recovery_process_trans
            if (log->l_recovery_lsn != trans->r_lsn &&
                ohead->oh_flags & XLOG_COMMIT_TRANS) { 
              xfs_buf_delwri_submit(buffer_list); //submit buffer list
              log->l_recovery_lsn = trans->r_lsn;
            }
            xlog_recovery_process_trans
              xlog_recover_commit_trans
                xlog_recover_items_pass2
                  item->ri_ops->commit_pass2
                    xlog_recover_buf_commit_pass2
                      xfs_buf_delwri_queue(bp, buffer_list) //add bp to buffer list
    if (!list_empty(&buffer_list)) 
      /* submit buffers that was added in checkpoint recovery of last LSN */
      xfs_buf_delwri_submit(&buffer_list)

So, I think it should be:
    
checkpoint  start LSN	commit lsn	submission on commit record
A		32	  63		nothing, buffer list is empty
B		64	  68		buffer list for A
C		64	  92		nothing, start lsn unchanged
D		64	 127		nothing, start lsn unchanged
E		128	 192		buffer list for B, C and D

Thanks,
Long Li


> IOWs, the invariant "don't submit buffers until LSN changes" is not
> actually implemented correctly by this code. This is the obvious
> aspect of the problem, but addressing buffer submission doesn't
> actually fix the problem.
> 
> That is, changing buffer submission to be correct doesn't address
> the fact that we've already done things like updated the LSN in
> inodes and dquots during recovery of those objects. Hence,
> regardless of whether we submit the buffers or not, changes to
> non-buffer objects in checkpoints C and D will never get recovered
> directly if they were originally modified in checkpoint B.
> 
> This is the problem we need to address: if we have multiple
> checkpoints at the same start LSN, we need to ensure that all the
> changes to any object in any of the checkpoints at that start LSN
> are recovered. This is what we are not doing, and this is the root
> cause of the problem....
> 
> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
Long Li Jan. 10, 2024, 8:38 a.m. UTC | #9
On Tue, Jan 09, 2024 at 09:52:37AM -0500, Brian Foster wrote:
> > > > > After commit 12818d24db8a ("xfs: rework log recovery to submit buffers on
> > > > > LSN boundaries") was introduced, we submit buffers on lsn boundaries during
> > > > > log recovery. 
> > > > 
> > > > Correct - we submit all the changes in a checkpoint for submission
> > > > before we start recovering the next checkpoint. That's because
> > > > checkpoints are supposed to be atomic units of change moving the
> > > > on-disk state from one change set to the next.
> > > 
> > > Submit buffer on LSN boundaries not means submit buffer on checkpoint
> > > boundaries during recovery. In my understanding, One transaction on disk
> > > corresponds to a checkpoint, there's maybe multiple transaction on disk
> > > share same LSN, so sometimes we should ensure that submit multiple
> > > transation one time in such case.  This rule was introduced by commit
> > > 12818d24db8a ("xfs: rework log recovery to submit buffers on LSN boundaries")
> > 
> > Well, yes, that's exactly the situation that commit 12818d24db8a was
> > intended to handle:
> > 
> >     "If independent transactions share an LSN and both modify the
> >     same buffer, log recovery can incorrectly skip updates and leave
> >     the filesystem in an inconsisent state."
> > 
> > Unfortunately, we didn't take into account the complexity of
> > mutliple transactions sharing the same start LSN in commit
> > 12818d24db8a ("xfs: rework log recovery to submit buffers on LSN
> > boundaries") back in 2016.
> > 
> > Indeed, we didn't even know that there was a reliance on strict
> > start record LSN ordering in journal recovery until 2021:
> > 
> > commit 68a74dcae6737c27b524b680e070fe41f0cad43a
> > Author: Dave Chinner <dchinner@redhat.com>
> > Date:   Tue Aug 10 18:00:44 2021 -0700
> > 
> >     xfs: order CIL checkpoint start records
> >     
> >     Because log recovery depends on strictly ordered start records as
> >     well as strictly ordered commit records.
> >     
> >     This is a zero day bug in the way XFS writes pipelined transactions
> >     to the journal which is exposed by fixing the zero day bug that
> >     prevents the CIL from pipelining checkpoints. This re-introduces
> >     explicit concurrent commits back into the on-disk journal and hence
> >     out of order start records.
> >     
> >     The XFS journal commit code has never ordered start records and we
> >     have relied on strict commit record ordering for correct recovery
> >     ordering of concurrently written transactions. Unfortunately, root
> >     cause analysis uncovered the fact that log recovery uses the LSN of
> >     the start record for transaction commit processing. Hence, whilst
> >     the commits are processed in strict order by recovery, the LSNs
> >     associated with the commits can be out of order and so recovery may
> >     stamp incorrect LSNs into objects and/or misorder intents in the AIL
> >     for later processing. This can result in log recovery failures
> >     and/or on disk corruption, sometimes silent.
> >     
> >     Because this is a long standing log recovery issue, we can't just
> >     fix log recovery and call it good. This still leaves older kernels
> >     susceptible to recovery failures and corruption when replaying a log
> >     from a kernel that pipelines checkpoints. There is also the issue
> >     that in-memory ordering for AIL pushing and data integrity
> >     operations are based on checkpoint start LSNs, and if the start LSN
> >     is incorrect in the journal, it is also incorrect in memory.
> >     
> >     Hence there's really only one choice for fixing this zero-day bug:
> >     we need to strictly order checkpoint start records in ascending
> >     sequence order in the log, the same way we already strictly order
> >     commit records.
> >     
> >     Signed-off-by: Dave Chinner <dchinner@redhat.com>
> >     Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> >     Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > 
> > Essentially, the problem now is that even with strictly ordered
> > start records for checkpoints, checkpoints with the same start LSN
> > interfere with each other in recovery because recovery is not
> > aware of the fact that we can have multiple checkpoints that start
> > with the same LSN.
> > 
> > This is another zero-day issue with the journal and log recovery;
> > originally there was no "anti-recovery" logic in the journal like we
> > have now with LSNs to prevent recovery from taking metadata state
> > backwards.  Hence log recovery just always replayed every change
> > that was in the journal from start to finish and so there was never
> > a problem with having multiple start records in the same log record.
> > 
> > However, this was known to cause problems with inodes and data vs
> > metadata sequencing and non-transactional inode metadata updates
> > (e.g. inode size), so a "flush iteration" counter was added to
> > inodes in 2003:
> > 
> > commit 6ed3d868e47470a301b49f1e8626972791206f50
> > Author: Steve Lord <lord@sgi.com>
> > Date:   Wed Aug 6 21:17:05 2003 +0000
> > 
> >     Add versioning to the on disk inode which we increment on each
> >     flush call. This is used during recovery to avoid replaying an
> >     older copy of the inode from the log. We can do this without
> >     versioning the filesystem as the pad space we borrowed was
> >     always zero and will be ignored by old kernels.
> >     During recovery, do not replay an inode log record which is older
> >     than the on disk copy. Check for wrapping in the counter.
> > 
> > This was never fully reliable, and there was always issues with
> > this counter because inode changes weren't always journalled nor
> > were cache flushes used to ensure unlogged inode metadata updates
> > reached stable storage.
> > 
> > The LSN sequencing was added to the v5 format to ensure metadata
> > never goes backwards in time on disk without fail. The issue you've
> > uncovered shows that we still have issues stemming from the
> > original journal recovery algorithm that was not designed with
> > anti-recovery protections in mind from the start.
> > 
> > The problem we need to solve is how we preserve the necessary
> > anti-recovery behaviour when we have multiple checkpoints that can
> > have the same LSN and objects are updated immediately on recovery?
> > 
> > I suspect that we need to track that the checkpoint being recovered
> > has a duplicate start LSN (i.e. in the struct xlog_recover) and
> > modify the anti-recovery LSN check to take this into account. i.e.
> > we can really only skip recovery of the first checkpoint at any
> > given LSN because we cannot disambiguate an LSN updated by the first
> > checkpoint at that LSN and the metadata already being up to date on
> > disk in the second and subsequent checkpoints at the same start
> > LSN.
> > 
> > There are likely to be other solutions - anyone have a different
> > idea on how we might address this?
> > 
> 
> It's been a while since I've looked at any of this and I haven't waded
> through all of the details, so I could easily be missing something, but
> what exactly is wrong with the approach of the patch as posted?
> 
> Commit 12818d24db ("xfs: rework log recovery to submit buffers on LSN
> boundaries") basically created a new invariant for log recovery where
> buffers are allowed to be written only once per LSN. The risk otherwise
> is that a subsequent update with a matching LSN would not be correctly
> applied due to the v5 LSN ordering rules. Since log recovery processes
> transactions (using terminology/granularity as defined by the
> implementation of xlog_recover_commit_trans()), this required changes to
> accommodate any of the various possible runtime logging scenarios that
> could cause a buffer to have multiple entries in the log associated with
> a single LSN, the details of which were orthogonal to the fix.
> 
> The functional change therefore was that rather than to process and
> submit "transactions" in sequence during recovery, the pending buffer
> list was lifted to a higher level in the code, a tracking field was
> added for the "current LSN" of log recovery, and only once we cross a
> current LSN boundary are we allowed to submit the set of buffers
> processed for the prior LSN. The reason for this logic is that seeing
> the next LSN was really the only way we know we're done processing items
> for a particular LSN.
> 
> If I understand the problem description correctly, the issue here is
> that if an error is encountered in the middle of processing items for
> some LSN A, we bail out of recovery and submit the pending buffers on
> the way out. If we haven't completed processing all items for LSN A
> before failing, however, then we've just possibly violated the "write
> once per LSN" invariant that protects from corrupting the fs. This is
> because the writeback permanently updates metadata LSNs (assuming that
> I/O doesn't fail), which means if recovery retries from the same point
> the next time around and progresses to find a second instance of an
> already written buffer in LSN A, it will exhibit the same general
> behavior from before the write once invariant was introduced. IOW,
> there's still a vector to the original problematic multi-write per LSN
> behavior through multiple recovery attempts (hence the simulated I/O
> error to reproduce).
> 
> Long Li, am I following the problem description correctly? I've not
> fully reviewed it, but if so, the proposed solution seems fairly sane
> and logical to me. (And nice work tracking this down, BTW, regardless of
> whether this is the final solution. ;).
> 

Hi, Brian, your description is correct for me, and it is clear and easy
to understand. Thanks for your encouragement of my work.

Thanks,
Long Li
Dave Chinner Jan. 10, 2024, 11:08 p.m. UTC | #10
On Wed, Jan 10, 2024 at 03:03:24PM +0800, Long Li wrote:
> On Wed, Jan 10, 2024 at 08:43:31AM +1100, Dave Chinner wrote:
> > The issue is that the code as it stands doesn't handle object
> > recovery from multiple checkpoints with the same start lsn. The
> > easiest way to understand this is to look at the buffer submit logic
> > on completion of a checkpoint:
> > 
> > 	if (log->l_recovery_lsn != trans->r_lsn &&
> >             ohead->oh_flags & XLOG_COMMIT_TRANS) {
> >                 error = xfs_buf_delwri_submit(buffer_list);
> >                 if (error)
> >                         return error;
> >                 log->l_recovery_lsn = trans->r_lsn;
> >         }
> > 
> > This submits the buffer list on the first checkpoint that completes
> > with a new start LSN, not when all the checkpoints with the same
> > start LSN complete. i.e.:
> > 
> > checkpoint  start LSN	commit lsn	submission on commit record
> > A		32	  63		buffer list for A
> > B		64	  68		buffer list for B
> > C		64	  92		nothing, start lsn unchanged
> > D		64	 127		nothing, start lsn unchanged
> > E		128	 192		buffer list for C, D and E
> > 
> 
> I have different understanding about this code. In the first checkpoint's
> handle on commit record, buffer_list is empty and l_recovery_lsn update to
> the first checkpoint's lsn, the result is that each checkpoint's submit
> logic try to submit the buffers which was added to buffer list in checkpoint
> recovery of previous LSN.
> 
>   xlog_do_recovery_pass
>     LIST_HEAD (buffer_list);
>     xlog_recover_process
>       xlog_recover_process_data
>         xlog_recover_process_ophdr
>           xlog_recovery_process_trans
>             if (log->l_recovery_lsn != trans->r_lsn &&
>                 ohead->oh_flags & XLOG_COMMIT_TRANS) { 
>               xfs_buf_delwri_submit(buffer_list); //submit buffer list
>               log->l_recovery_lsn = trans->r_lsn;
>             }
>             xlog_recovery_process_trans
>               xlog_recover_commit_trans
>                 xlog_recover_items_pass2
>                   item->ri_ops->commit_pass2
>                     xlog_recover_buf_commit_pass2
>                       xfs_buf_delwri_queue(bp, buffer_list) //add bp to buffer list
>     if (!list_empty(&buffer_list)) 
>       /* submit buffers that was added in checkpoint recovery of last LSN */
>       xfs_buf_delwri_submit(&buffer_list)
> 
> So, I think it should be:
>     
> checkpoint  start LSN	commit lsn	submission on commit record
> A		32	  63		nothing, buffer list is empty
> B		64	  68		buffer list for A
> C		64	  92		nothing, start lsn unchanged
> D		64	 127		nothing, start lsn unchanged
> E		128	 192		buffer list for B, C and D

You are right, I made a mistake in determining the order of buffer
list submission vs checkpoint recovery taht builds a given buffer
list. Mistakes happen when you only look at complex code once every
few years. I will go back and look at the original patch again with
this in mind.

-Dave.
Dave Chinner Jan. 11, 2024, 12:47 a.m. UTC | #11
On Thu, Dec 28, 2023 at 08:46:46PM +0800, Long Li wrote:
> In order to make sure that submits buffers on lsn boundaries in the
> abnormal paths, we need to check error status before submit buffers that
> have been added from the last record processed. If error status exist,
> buffers in the bufffer_list should be canceled.
> 
> Canceling the buffers in the buffer_list directly isn't correct, unlike
> any other place where write list was canceled, these buffers has been
> initialized by xfs_buf_item_init() during recovery and held by buf
> item, buf items will not be released in xfs_buf_delwri_cancel(). If
> these buffers are submitted successfully, buf items assocated with
> the buffer will be released in io end process. So releasing buf item
> in write list cacneling process is needed.

I still don't think this is correct.

> Fixes: 50d5c8d8e938 ("xfs: check LSN ordering for v5 superblocks during recovery")
> Signed-off-by: Long Li <leo.lilong@huawei.com>
> ---
>  fs/xfs/xfs_buf.c         |  2 ++
>  fs/xfs/xfs_log_recover.c | 22 +++++++++++++---------
>  2 files changed, 15 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index 8e5bd50d29fe..6a1b26aaf97e 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -2075,6 +2075,8 @@ xfs_buf_delwri_cancel(
>  		xfs_buf_lock(bp);
>  		bp->b_flags &= ~_XBF_DELWRI_Q;
>  		xfs_buf_list_del(bp);
> +		if (bp->b_log_item)
> +			xfs_buf_item_relse(bp);
>  		xfs_buf_relse(bp);

I still don't think this is safe.  The buffer log item might still be
tracked in the AIL when the delwri list is cancelled, so the delwri
list cancelling cannot release the BLI without removing the item
from the AIL, too. The delwri cancelling walk really shouldn't be
screwing with AIL state, which means it can't touch the BLIs here.

At minimum, it's a landmine for future users of
xfs_buf_delwri_cancel().  A quick look at the quotacheck code
indicates that it can cancel delwri lists that have BLIs in the AIL
(for newly allocated dquot chunks), so I think this is a real concern.

This is one of the reasons for submitting the delwri list on error;
the IO completion code does all the correct cleanup of log items
including removing them from the AIL because the buffer is now
either clean or stale and no longer needs to be tracked by the AIL.

If the filesystem has been shut down, then delwri list submission
will error out all buffers on the list via IO submission/completion
and do all the correct cleanup automatically.

I note that write IO errors during log recovery will cause immediate
shutdown of the filesytsem via xfs_buf_ioend_handle_error():

	/*
         * We're not going to bother about retrying this during recovery.
         * One strike!
         */
        if (bp->b_flags & _XBF_LOGRECOVERY) {
                xfs_force_shutdown(mp, SHUTDOWN_META_IO_ERROR);
                return false;
        }

So I'm guessing that the IO error injection error that caused this
failure was on a buffer read part way through recovering items.

Can you confirm that the failure is only seen after read IO error
injection and that write IO error injection causes immediate
shutdown and so avoids the problem altogether?

If so, then all we need to do to handle instantiation side errors (EIO, ENOMEM,
etc) is this:

	/*
	 * Submit buffers that have been dirtied by the last record recovered.
	 */
	if (!list_empty(&buffer_list)) {
		if (error) {
			/*
			 * If there has been an item recovery error then we
			 * cannot allow partial checkpoint writeback to
			 * occur.  We might have multiple checkpoints with the
			 * same start LSN in this buffer list, and partial
			 * writeback of a checkpoint in this situation can
			 * prevent future recovery of all the changes in the
			 * checkpoints at this start LSN.
			 *
			 * Note: Shutting down the filesystem will result in the
			 * delwri submission marking all the buffers stale,
			 * completing them and cleaning up _XBF_LOGRECOVERY
			 * state without doing any IO.
			 */
			xlog_force_shutdown(log, SHUTDOWN_LOG_IO_ERROR);
		}
		error2 = xfs_buf_delwri_submit(&buffer_list);
	}

Cheers,

Dave.
Brian Foster Jan. 11, 2024, 2:54 p.m. UTC | #12
On Thu, Jan 11, 2024 at 10:08:51AM +1100, Dave Chinner wrote:
> On Wed, Jan 10, 2024 at 03:03:24PM +0800, Long Li wrote:
> > On Wed, Jan 10, 2024 at 08:43:31AM +1100, Dave Chinner wrote:
> > > The issue is that the code as it stands doesn't handle object
> > > recovery from multiple checkpoints with the same start lsn. The
> > > easiest way to understand this is to look at the buffer submit logic
> > > on completion of a checkpoint:
> > > 
> > > 	if (log->l_recovery_lsn != trans->r_lsn &&
> > >             ohead->oh_flags & XLOG_COMMIT_TRANS) {
> > >                 error = xfs_buf_delwri_submit(buffer_list);
> > >                 if (error)
> > >                         return error;
> > >                 log->l_recovery_lsn = trans->r_lsn;
> > >         }
> > > 
> > > This submits the buffer list on the first checkpoint that completes
> > > with a new start LSN, not when all the checkpoints with the same
> > > start LSN complete. i.e.:
> > > 
> > > checkpoint  start LSN	commit lsn	submission on commit record
> > > A		32	  63		buffer list for A
> > > B		64	  68		buffer list for B
> > > C		64	  92		nothing, start lsn unchanged
> > > D		64	 127		nothing, start lsn unchanged
> > > E		128	 192		buffer list for C, D and E
> > > 
> > 
> > I have different understanding about this code. In the first checkpoint's
> > handle on commit record, buffer_list is empty and l_recovery_lsn update to
> > the first checkpoint's lsn, the result is that each checkpoint's submit
> > logic try to submit the buffers which was added to buffer list in checkpoint
> > recovery of previous LSN.
> > 
> >   xlog_do_recovery_pass
> >     LIST_HEAD (buffer_list);
> >     xlog_recover_process
> >       xlog_recover_process_data
> >         xlog_recover_process_ophdr
> >           xlog_recovery_process_trans
> >             if (log->l_recovery_lsn != trans->r_lsn &&
> >                 ohead->oh_flags & XLOG_COMMIT_TRANS) { 
> >               xfs_buf_delwri_submit(buffer_list); //submit buffer list
> >               log->l_recovery_lsn = trans->r_lsn;
> >             }
> >             xlog_recovery_process_trans
> >               xlog_recover_commit_trans
> >                 xlog_recover_items_pass2
> >                   item->ri_ops->commit_pass2
> >                     xlog_recover_buf_commit_pass2
> >                       xfs_buf_delwri_queue(bp, buffer_list) //add bp to buffer list
> >     if (!list_empty(&buffer_list)) 
> >       /* submit buffers that was added in checkpoint recovery of last LSN */
> >       xfs_buf_delwri_submit(&buffer_list)
> > 
> > So, I think it should be:
> >     
> > checkpoint  start LSN	commit lsn	submission on commit record
> > A		32	  63		nothing, buffer list is empty
> > B		64	  68		buffer list for A
> > C		64	  92		nothing, start lsn unchanged
> > D		64	 127		nothing, start lsn unchanged
> > E		128	 192		buffer list for B, C and D

Yeah, I managed to look through some of this code yesterday but hadn't
got back to this. I suspect the disconnect here is that the buffer list
isn't populated until we process the commit record for the associated
log transaction, and that occurs after the check to submit the buffer
list.

So when the commit record of A is seen, the recovery LSN is updated to
32 and the buffer list is still empty, and then is subsequently
populated based on the content of the transaction. From there, not until
we see a commit record for a transaction with a start LSN != 32 is the
buffer list submitted. This is how recovery knows there are no more
metadata lsn == 32 items, and the buffer list may include any number of
checkpoints that start at the current recovery lsn.

In general, this relies on proper commit record ordering which AFAIA
remains a pretty fundamental rule of XFS logging. It seems the confusion
has been resolved in any event, but something that comes to mind as
possibly beneficial to future readers is to perhaps consider turning
this submit check into a small helper with a proper name and any tweaks
that might help clarify the existing comment.

I think Dave's followup comments to invoke shutdown rather than play
games with cancellation make sense, so I'm particularly wondering if we
could just create an xlog_recovery_process_buffers() or some such helper
that could be reused in both places by conditionally doing the
shutdown/submit on error, and then documents everything nicely in once
place. Just a thought, however. It may not be worth it depending on how
ugly it looks.

Brian

> 
> You are right, I made a mistake in determining the order of buffer
> list submission vs checkpoint recovery taht builds a given buffer
> list. Mistakes happen when you only look at complex code once every
> few years. I will go back and look at the original patch again with
> this in mind.
> 
> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
>
Long Li Jan. 12, 2024, 12:55 p.m. UTC | #13
On Thu, Jan 11, 2024 at 11:47:47AM +1100, Dave Chinner wrote:
> On Thu, Dec 28, 2023 at 08:46:46PM +0800, Long Li wrote:
> > In order to make sure that submits buffers on lsn boundaries in the
> > abnormal paths, we need to check error status before submit buffers that
> > have been added from the last record processed. If error status exist,
> > buffers in the bufffer_list should be canceled.
> > 
> > Canceling the buffers in the buffer_list directly isn't correct, unlike
> > any other place where write list was canceled, these buffers has been
> > initialized by xfs_buf_item_init() during recovery and held by buf
> > item, buf items will not be released in xfs_buf_delwri_cancel(). If
> > these buffers are submitted successfully, buf items assocated with
> > the buffer will be released in io end process. So releasing buf item
> > in write list cacneling process is needed.
> 
> I still don't think this is correct.
> 
> > Fixes: 50d5c8d8e938 ("xfs: check LSN ordering for v5 superblocks during recovery")
> > Signed-off-by: Long Li <leo.lilong@huawei.com>
> > ---
> >  fs/xfs/xfs_buf.c         |  2 ++
> >  fs/xfs/xfs_log_recover.c | 22 +++++++++++++---------
> >  2 files changed, 15 insertions(+), 9 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> > index 8e5bd50d29fe..6a1b26aaf97e 100644
> > --- a/fs/xfs/xfs_buf.c
> > +++ b/fs/xfs/xfs_buf.c
> > @@ -2075,6 +2075,8 @@ xfs_buf_delwri_cancel(
> >  		xfs_buf_lock(bp);
> >  		bp->b_flags &= ~_XBF_DELWRI_Q;
> >  		xfs_buf_list_del(bp);
> > +		if (bp->b_log_item)
> > +			xfs_buf_item_relse(bp);
> >  		xfs_buf_relse(bp);
> 
> I still don't think this is safe.  The buffer log item might still be
> tracked in the AIL when the delwri list is cancelled, so the delwri
> list cancelling cannot release the BLI without removing the item
> from the AIL, too. The delwri cancelling walk really shouldn't be
> screwing with AIL state, which means it can't touch the BLIs here.
> 
> At minimum, it's a landmine for future users of
> xfs_buf_delwri_cancel().  A quick look at the quotacheck code
> indicates that it can cancel delwri lists that have BLIs in the AIL
> (for newly allocated dquot chunks), so I think this is a real concern.
> 
> This is one of the reasons for submitting the delwri list on error;
> the IO completion code does all the correct cleanup of log items
> including removing them from the AIL because the buffer is now
> either clean or stale and no longer needs to be tracked by the AIL.

Yes, it's not a safety solution.

> 
> If the filesystem has been shut down, then delwri list submission
> will error out all buffers on the list via IO submission/completion
> and do all the correct cleanup automatically.
> 
> I note that write IO errors during log recovery will cause immediate
> shutdown of the filesytsem via xfs_buf_ioend_handle_error():
> 
> 	/*
>          * We're not going to bother about retrying this during recovery.
>          * One strike!
>          */
>         if (bp->b_flags & _XBF_LOGRECOVERY) {
>                 xfs_force_shutdown(mp, SHUTDOWN_META_IO_ERROR);
>                 return false;
>         }
> 
> So I'm guessing that the IO error injection error that caused this
> failure was on a buffer read part way through recovering items.
> 
> Can you confirm that the failure is only seen after read IO error
> injection and that write IO error injection causes immediate
> shutdown and so avoids the problem altogether?

This problem reproduce very hard, we reproduce it only three times.
There may be several mounts between writing buffer not on LSN boundaries
and reporting free space btree corruption, I can't distinguish the
violation happend in which mount during test. So judging by the message
I've reprodced, I can't confirm that the failure is only seen after read
IO error injection. Look at one of the kernel message I've reprodced,
there are several mount fails before reporting free space btree corruption,
the reasons of mount fail include read IO error and write IO error.

[51555.801349] XFS (dm-3): Mounting V5 Filesystem
[51555.982130] XFS (dm-3): Starting recovery (logdev: internal)
[51558.153638] FAULT_INJECTION: forcing a failure.
               name fail_make_request, interval 20, probability 1, space 0, times -1
[51558.153723] XFS (dm-3): log recovery read I/O error at daddr 0x3972 len 1 error -5
[51558.165996] XFS (dm-3): log mount/recovery failed: error -5
[51558.166880] XFS (dm-3): log mount failed
[51558.410963] XFS (dm-3): EXPERIMENTAL big timestamp feature in use. Use at your own risk!
[51558.410981] XFS (dm-3): EXPERIMENTAL inode btree counters feature in use. Use at your own risk!
[51558.413074] XFS (dm-3): Mounting V5 Filesystem
[51558.595739] XFS (dm-3): Starting recovery (logdev: internal)
[51559.592552] FAULT_INJECTION: forcing a failure.
               name fail_make_request, interval 20, probability 1, space 0, times -1
[51559.593008] XFS (dm-3): metadata I/O error in "xfs_buf_ioend_handle_error+0x170/0x760 [xfs]" at daddr 0x1879e0 len 32 error 5
[51559.593335] XFS (dm-3): Metadata I/O Error (0x1) detected at xfs_buf_ioend_handle_error+0x63c/0x760 [xfs] (fs/xfs/xfs_buf.c:1272).  Shutting down filesystem.
[51559.593346] XFS (dm-3): Please unmount the filesystem and rectify the problem(s)
[51559.602833] XFS (dm-3): log mount/recovery failed: error -5
[51559.603772] XFS (dm-3): log mount failed
[51559.835690] XFS (dm-3): EXPERIMENTAL big timestamp feature in use. Use at your own risk!
[51559.835708] XFS (dm-3): EXPERIMENTAL inode btree counters feature in use. Use at your own risk!
[51559.837829] XFS (dm-3): Mounting V5 Filesystem
[51560.024083] XFS (dm-3): Starting recovery (logdev: internal)
[51562.155545] FAULT_INJECTION: forcing a failure.
               name fail_make_request, interval 20, probability 1, space 0, times -1
[51562.445074] XFS (dm-3): Ending recovery (logdev: internal)
[51563.553960] XFS (dm-3): Internal error ltbno + ltlen > bno at line 1976 of file fs/xfs/libxfs/xfs_alloc.c.  Caller xfs_free_ag_extent+0x558/0xd80 [xfs]
[51563.558629] XFS (dm-3): Corruption detected. Unmount and run xfs_repair

> 
> If so, then all we need to do to handle instantiation side errors (EIO, ENOMEM,
> etc) is this:
> 
> 	/*
> 	 * Submit buffers that have been dirtied by the last record recovered.
> 	 */
> 	if (!list_empty(&buffer_list)) {
> 		if (error) {
> 			/*
> 			 * If there has been an item recovery error then we
> 			 * cannot allow partial checkpoint writeback to
> 			 * occur.  We might have multiple checkpoints with the
> 			 * same start LSN in this buffer list, and partial
> 			 * writeback of a checkpoint in this situation can
> 			 * prevent future recovery of all the changes in the
> 			 * checkpoints at this start LSN.
> 			 *
> 			 * Note: Shutting down the filesystem will result in the
> 			 * delwri submission marking all the buffers stale,
> 			 * completing them and cleaning up _XBF_LOGRECOVERY
> 			 * state without doing any IO.
> 			 */
> 			xlog_force_shutdown(log, SHUTDOWN_LOG_IO_ERROR);
> 		}
> 		error2 = xfs_buf_delwri_submit(&buffer_list);
> 	}
>

This solution is also used in our internally maintained linux branch,
and after several months of testing, the problem no longer arises. It
seems safe and reasonable enough.

Thanks,
Long Li
Brian Foster Jan. 12, 2024, 6:42 p.m. UTC | #14
On Fri, Jan 12, 2024 at 08:55:47PM +0800, Long Li wrote:
> On Thu, Jan 11, 2024 at 11:47:47AM +1100, Dave Chinner wrote:
> > On Thu, Dec 28, 2023 at 08:46:46PM +0800, Long Li wrote:
> > > In order to make sure that submits buffers on lsn boundaries in the
> > > abnormal paths, we need to check error status before submit buffers that
> > > have been added from the last record processed. If error status exist,
> > > buffers in the bufffer_list should be canceled.
> > > 
> > > Canceling the buffers in the buffer_list directly isn't correct, unlike
> > > any other place where write list was canceled, these buffers has been
> > > initialized by xfs_buf_item_init() during recovery and held by buf
> > > item, buf items will not be released in xfs_buf_delwri_cancel(). If
> > > these buffers are submitted successfully, buf items assocated with
> > > the buffer will be released in io end process. So releasing buf item
> > > in write list cacneling process is needed.
> > 
> > I still don't think this is correct.
> > 
> > > Fixes: 50d5c8d8e938 ("xfs: check LSN ordering for v5 superblocks during recovery")
> > > Signed-off-by: Long Li <leo.lilong@huawei.com>
> > > ---
> > >  fs/xfs/xfs_buf.c         |  2 ++
> > >  fs/xfs/xfs_log_recover.c | 22 +++++++++++++---------
> > >  2 files changed, 15 insertions(+), 9 deletions(-)
> > > 
> > > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> > > index 8e5bd50d29fe..6a1b26aaf97e 100644
> > > --- a/fs/xfs/xfs_buf.c
> > > +++ b/fs/xfs/xfs_buf.c
> > > @@ -2075,6 +2075,8 @@ xfs_buf_delwri_cancel(
> > >  		xfs_buf_lock(bp);
> > >  		bp->b_flags &= ~_XBF_DELWRI_Q;
> > >  		xfs_buf_list_del(bp);
> > > +		if (bp->b_log_item)
> > > +			xfs_buf_item_relse(bp);
> > >  		xfs_buf_relse(bp);
> > 
> > I still don't think this is safe.  The buffer log item might still be
> > tracked in the AIL when the delwri list is cancelled, so the delwri
> > list cancelling cannot release the BLI without removing the item
> > from the AIL, too. The delwri cancelling walk really shouldn't be
> > screwing with AIL state, which means it can't touch the BLIs here.
> > 
> > At minimum, it's a landmine for future users of
> > xfs_buf_delwri_cancel().  A quick look at the quotacheck code
> > indicates that it can cancel delwri lists that have BLIs in the AIL
> > (for newly allocated dquot chunks), so I think this is a real concern.
> > 
> > This is one of the reasons for submitting the delwri list on error;
> > the IO completion code does all the correct cleanup of log items
> > including removing them from the AIL because the buffer is now
> > either clean or stale and no longer needs to be tracked by the AIL.
> 
> Yes, it's not a safety solution.
> 
> > 
> > If the filesystem has been shut down, then delwri list submission
> > will error out all buffers on the list via IO submission/completion
> > and do all the correct cleanup automatically.
> > 
> > I note that write IO errors during log recovery will cause immediate
> > shutdown of the filesytsem via xfs_buf_ioend_handle_error():
> > 
> > 	/*
> >          * We're not going to bother about retrying this during recovery.
> >          * One strike!
> >          */
> >         if (bp->b_flags & _XBF_LOGRECOVERY) {
> >                 xfs_force_shutdown(mp, SHUTDOWN_META_IO_ERROR);
> >                 return false;
> >         }
> > 
> > So I'm guessing that the IO error injection error that caused this
> > failure was on a buffer read part way through recovering items.
> > 
> > Can you confirm that the failure is only seen after read IO error
> > injection and that write IO error injection causes immediate
> > shutdown and so avoids the problem altogether?
> 
> This problem reproduce very hard, we reproduce it only three times.
> There may be several mounts between writing buffer not on LSN boundaries
> and reporting free space btree corruption, I can't distinguish the
> violation happend in which mount during test. So judging by the message
> I've reprodced, I can't confirm that the failure is only seen after read
> IO error injection. Look at one of the kernel message I've reprodced,
> there are several mount fails before reporting free space btree corruption,
> the reasons of mount fail include read IO error and write IO error.
> 
> [51555.801349] XFS (dm-3): Mounting V5 Filesystem
> [51555.982130] XFS (dm-3): Starting recovery (logdev: internal)
> [51558.153638] FAULT_INJECTION: forcing a failure.
>                name fail_make_request, interval 20, probability 1, space 0, times -1
> [51558.153723] XFS (dm-3): log recovery read I/O error at daddr 0x3972 len 1 error -5
> [51558.165996] XFS (dm-3): log mount/recovery failed: error -5
> [51558.166880] XFS (dm-3): log mount failed
> [51558.410963] XFS (dm-3): EXPERIMENTAL big timestamp feature in use. Use at your own risk!
> [51558.410981] XFS (dm-3): EXPERIMENTAL inode btree counters feature in use. Use at your own risk!
> [51558.413074] XFS (dm-3): Mounting V5 Filesystem
> [51558.595739] XFS (dm-3): Starting recovery (logdev: internal)
> [51559.592552] FAULT_INJECTION: forcing a failure.
>                name fail_make_request, interval 20, probability 1, space 0, times -1
> [51559.593008] XFS (dm-3): metadata I/O error in "xfs_buf_ioend_handle_error+0x170/0x760 [xfs]" at daddr 0x1879e0 len 32 error 5
> [51559.593335] XFS (dm-3): Metadata I/O Error (0x1) detected at xfs_buf_ioend_handle_error+0x63c/0x760 [xfs] (fs/xfs/xfs_buf.c:1272).  Shutting down filesystem.
> [51559.593346] XFS (dm-3): Please unmount the filesystem and rectify the problem(s)
> [51559.602833] XFS (dm-3): log mount/recovery failed: error -5
> [51559.603772] XFS (dm-3): log mount failed
> [51559.835690] XFS (dm-3): EXPERIMENTAL big timestamp feature in use. Use at your own risk!
> [51559.835708] XFS (dm-3): EXPERIMENTAL inode btree counters feature in use. Use at your own risk!
> [51559.837829] XFS (dm-3): Mounting V5 Filesystem
> [51560.024083] XFS (dm-3): Starting recovery (logdev: internal)
> [51562.155545] FAULT_INJECTION: forcing a failure.
>                name fail_make_request, interval 20, probability 1, space 0, times -1
> [51562.445074] XFS (dm-3): Ending recovery (logdev: internal)
> [51563.553960] XFS (dm-3): Internal error ltbno + ltlen > bno at line 1976 of file fs/xfs/libxfs/xfs_alloc.c.  Caller xfs_free_ag_extent+0x558/0xd80 [xfs]
> [51563.558629] XFS (dm-3): Corruption detected. Unmount and run xfs_repair
> 
> > 
> > If so, then all we need to do to handle instantiation side errors (EIO, ENOMEM,
> > etc) is this:
> > 
> > 	/*
> > 	 * Submit buffers that have been dirtied by the last record recovered.
> > 	 */
> > 	if (!list_empty(&buffer_list)) {
> > 		if (error) {
> > 			/*
> > 			 * If there has been an item recovery error then we
> > 			 * cannot allow partial checkpoint writeback to
> > 			 * occur.  We might have multiple checkpoints with the
> > 			 * same start LSN in this buffer list, and partial
> > 			 * writeback of a checkpoint in this situation can
> > 			 * prevent future recovery of all the changes in the
> > 			 * checkpoints at this start LSN.
> > 			 *
> > 			 * Note: Shutting down the filesystem will result in the
> > 			 * delwri submission marking all the buffers stale,
> > 			 * completing them and cleaning up _XBF_LOGRECOVERY
> > 			 * state without doing any IO.
> > 			 */
> > 			xlog_force_shutdown(log, SHUTDOWN_LOG_IO_ERROR);
> > 		}
> > 		error2 = xfs_buf_delwri_submit(&buffer_list);
> > 	}
> >
> 
> This solution is also used in our internally maintained linux branch,
> and after several months of testing, the problem no longer arises. It
> seems safe and reasonable enough.
> 

I assume you're referring to the xfs_buf_delwri_cancel() change..? If
so, this is a valid data point but doesn't necessarily help explain
whether the change is correct in any other context. I/O cancel probably
doesn't happen often for one, and even if it does, it's not totally
clear if you're reproducing a situation where the item might be AIL
resident or not at the time (or it is and you have a use after free that
goes undetected). And even if none of that is relevant, that still
doesn't protect against future code changes if this code doesn't respect
the established bli lifecycle rules.

IIRC, the bli_refcount (sampled via xfs_buf_item_relse()) is not
necessarily an object lifecycle refcount. It simply reflects whether the
item exists in a transaction where it might eventually be dirtied. This
is somewhat tricky, but can also be surmised from some of the logic in
xfs_buf_item_put(), for example.

IOW, I think in the simple (non-recovery) case of a buffer being read,
modified and committed by a transaction, the bli would eventually end up
in a state where bli_refcount == 0 but is still resident in the AIL
until eventually written back by xfsaild. That metadata writeback
completion is what eventually frees the bli via xfs_buf_item_done().

So if I'm not mistaken wrt to the above example sequence, the
interesting question is if we suppose a buffer is in that intermediate
state of waiting for writeback, and then somebody were to hypothetically
execute a bit of code that simply added the associated buffer to a
delwri q and immediately cancelled it, what would happen with this
change in place? ISTM the remove would prematurely free the buffer/bli
while it's still resident in the AIL and pending writeback, thus
resulting in use-after-free or potential memory/list corruption, etc. Is
that not the case?

Brian
Long Li Jan. 15, 2024, 1:31 p.m. UTC | #15
On Fri, Jan 12, 2024 at 01:42:32PM -0500, Brian Foster wrote:
> On Fri, Jan 12, 2024 at 08:55:47PM +0800, Long Li wrote:
> > On Thu, Jan 11, 2024 at 11:47:47AM +1100, Dave Chinner wrote:
> > > On Thu, Dec 28, 2023 at 08:46:46PM +0800, Long Li wrote:
> > > > In order to make sure that submits buffers on lsn boundaries in the
> > > > abnormal paths, we need to check error status before submit buffers that
> > > > have been added from the last record processed. If error status exist,
> > > > buffers in the bufffer_list should be canceled.
> > > > 
> > > > Canceling the buffers in the buffer_list directly isn't correct, unlike
> > > > any other place where write list was canceled, these buffers has been
> > > > initialized by xfs_buf_item_init() during recovery and held by buf
> > > > item, buf items will not be released in xfs_buf_delwri_cancel(). If
> > > > these buffers are submitted successfully, buf items assocated with
> > > > the buffer will be released in io end process. So releasing buf item
> > > > in write list cacneling process is needed.
> > > 
> > > I still don't think this is correct.
> > > 
> > > > Fixes: 50d5c8d8e938 ("xfs: check LSN ordering for v5 superblocks during recovery")
> > > > Signed-off-by: Long Li <leo.lilong@huawei.com>
> > > > ---
> > > >  fs/xfs/xfs_buf.c         |  2 ++
> > > >  fs/xfs/xfs_log_recover.c | 22 +++++++++++++---------
> > > >  2 files changed, 15 insertions(+), 9 deletions(-)
> > > > 
> > > > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> > > > index 8e5bd50d29fe..6a1b26aaf97e 100644
> > > > --- a/fs/xfs/xfs_buf.c
> > > > +++ b/fs/xfs/xfs_buf.c
> > > > @@ -2075,6 +2075,8 @@ xfs_buf_delwri_cancel(
> > > >  		xfs_buf_lock(bp);
> > > >  		bp->b_flags &= ~_XBF_DELWRI_Q;
> > > >  		xfs_buf_list_del(bp);
> > > > +		if (bp->b_log_item)
> > > > +			xfs_buf_item_relse(bp);
> > > >  		xfs_buf_relse(bp);
> > > 
> > > I still don't think this is safe.  The buffer log item might still be
> > > tracked in the AIL when the delwri list is cancelled, so the delwri
> > > list cancelling cannot release the BLI without removing the item
> > > from the AIL, too. The delwri cancelling walk really shouldn't be
> > > screwing with AIL state, which means it can't touch the BLIs here.
> > > 
> > > At minimum, it's a landmine for future users of
> > > xfs_buf_delwri_cancel().  A quick look at the quotacheck code
> > > indicates that it can cancel delwri lists that have BLIs in the AIL
> > > (for newly allocated dquot chunks), so I think this is a real concern.
> > > 
> > > This is one of the reasons for submitting the delwri list on error;
> > > the IO completion code does all the correct cleanup of log items
> > > including removing them from the AIL because the buffer is now
> > > either clean or stale and no longer needs to be tracked by the AIL.
> > 
> > Yes, it's not a safety solution.
> > 
> > > 
> > > If the filesystem has been shut down, then delwri list submission
> > > will error out all buffers on the list via IO submission/completion
> > > and do all the correct cleanup automatically.
> > > 
> > > I note that write IO errors during log recovery will cause immediate
> > > shutdown of the filesytsem via xfs_buf_ioend_handle_error():
> > > 
> > > 	/*
> > >          * We're not going to bother about retrying this during recovery.
> > >          * One strike!
> > >          */
> > >         if (bp->b_flags & _XBF_LOGRECOVERY) {
> > >                 xfs_force_shutdown(mp, SHUTDOWN_META_IO_ERROR);
> > >                 return false;
> > >         }
> > > 
> > > So I'm guessing that the IO error injection error that caused this
> > > failure was on a buffer read part way through recovering items.
> > > 
> > > Can you confirm that the failure is only seen after read IO error
> > > injection and that write IO error injection causes immediate
> > > shutdown and so avoids the problem altogether?
> > 
> > This problem reproduce very hard, we reproduce it only three times.
> > There may be several mounts between writing buffer not on LSN boundaries
> > and reporting free space btree corruption, I can't distinguish the
> > violation happend in which mount during test. So judging by the message
> > I've reprodced, I can't confirm that the failure is only seen after read
> > IO error injection. Look at one of the kernel message I've reprodced,
> > there are several mount fails before reporting free space btree corruption,
> > the reasons of mount fail include read IO error and write IO error.
> > 
> > [51555.801349] XFS (dm-3): Mounting V5 Filesystem
> > [51555.982130] XFS (dm-3): Starting recovery (logdev: internal)
> > [51558.153638] FAULT_INJECTION: forcing a failure.
> >                name fail_make_request, interval 20, probability 1, space 0, times -1
> > [51558.153723] XFS (dm-3): log recovery read I/O error at daddr 0x3972 len 1 error -5
> > [51558.165996] XFS (dm-3): log mount/recovery failed: error -5
> > [51558.166880] XFS (dm-3): log mount failed
> > [51558.410963] XFS (dm-3): EXPERIMENTAL big timestamp feature in use. Use at your own risk!
> > [51558.410981] XFS (dm-3): EXPERIMENTAL inode btree counters feature in use. Use at your own risk!
> > [51558.413074] XFS (dm-3): Mounting V5 Filesystem
> > [51558.595739] XFS (dm-3): Starting recovery (logdev: internal)
> > [51559.592552] FAULT_INJECTION: forcing a failure.
> >                name fail_make_request, interval 20, probability 1, space 0, times -1
> > [51559.593008] XFS (dm-3): metadata I/O error in "xfs_buf_ioend_handle_error+0x170/0x760 [xfs]" at daddr 0x1879e0 len 32 error 5
> > [51559.593335] XFS (dm-3): Metadata I/O Error (0x1) detected at xfs_buf_ioend_handle_error+0x63c/0x760 [xfs] (fs/xfs/xfs_buf.c:1272).  Shutting down filesystem.
> > [51559.593346] XFS (dm-3): Please unmount the filesystem and rectify the problem(s)
> > [51559.602833] XFS (dm-3): log mount/recovery failed: error -5
> > [51559.603772] XFS (dm-3): log mount failed
> > [51559.835690] XFS (dm-3): EXPERIMENTAL big timestamp feature in use. Use at your own risk!
> > [51559.835708] XFS (dm-3): EXPERIMENTAL inode btree counters feature in use. Use at your own risk!
> > [51559.837829] XFS (dm-3): Mounting V5 Filesystem
> > [51560.024083] XFS (dm-3): Starting recovery (logdev: internal)
> > [51562.155545] FAULT_INJECTION: forcing a failure.
> >                name fail_make_request, interval 20, probability 1, space 0, times -1
> > [51562.445074] XFS (dm-3): Ending recovery (logdev: internal)
> > [51563.553960] XFS (dm-3): Internal error ltbno + ltlen > bno at line 1976 of file fs/xfs/libxfs/xfs_alloc.c.  Caller xfs_free_ag_extent+0x558/0xd80 [xfs]
> > [51563.558629] XFS (dm-3): Corruption detected. Unmount and run xfs_repair
> > 
> > > 
> > > If so, then all we need to do to handle instantiation side errors (EIO, ENOMEM,
> > > etc) is this:
> > > 
> > > 	/*
> > > 	 * Submit buffers that have been dirtied by the last record recovered.
> > > 	 */
> > > 	if (!list_empty(&buffer_list)) {
> > > 		if (error) {
> > > 			/*
> > > 			 * If there has been an item recovery error then we
> > > 			 * cannot allow partial checkpoint writeback to
> > > 			 * occur.  We might have multiple checkpoints with the
> > > 			 * same start LSN in this buffer list, and partial
> > > 			 * writeback of a checkpoint in this situation can
> > > 			 * prevent future recovery of all the changes in the
> > > 			 * checkpoints at this start LSN.
> > > 			 *
> > > 			 * Note: Shutting down the filesystem will result in the
> > > 			 * delwri submission marking all the buffers stale,
> > > 			 * completing them and cleaning up _XBF_LOGRECOVERY
> > > 			 * state without doing any IO.
> > > 			 */
> > > 			xlog_force_shutdown(log, SHUTDOWN_LOG_IO_ERROR);
> > > 		}
> > > 		error2 = xfs_buf_delwri_submit(&buffer_list);
> > > 	}
> > >
> > 
> > This solution is also used in our internally maintained linux branch,
> > and after several months of testing, the problem no longer arises. It
> > seems safe and reasonable enough.
> > 
> 
> I assume you're referring to the xfs_buf_delwri_cancel() change..? If

I'm not referring to the xfs_buf_delwri_cancel() solution, but to the
use of the xlog_force_shutdown() solution. We believe that the shutdown
solution is less risky because buffer list can be cleanup automatically
via IO submission/completion, it would not change any other logic, so
we've used it in our internally maintained linux branch.

On the other hand, I thought it would be better to positively cancel
the buffer list, so I sent it out, but I overlooked potential issues... 

> so, this is a valid data point but doesn't necessarily help explain
> whether the change is correct in any other context. I/O cancel probably
> doesn't happen often for one, and even if it does, it's not totally
> clear if you're reproducing a situation where the item might be AIL
> resident or not at the time (or it is and you have a use after free that
> goes undetected). And even if none of that is relevant, that still
> doesn't protect against future code changes if this code doesn't respect
> the established bli lifecycle rules.

Yes, agree with you.

> 
> IIRC, the bli_refcount (sampled via xfs_buf_item_relse()) is not
> necessarily an object lifecycle refcount. It simply reflects whether the
> item exists in a transaction where it might eventually be dirtied. This
> is somewhat tricky, but can also be surmised from some of the logic in
> xfs_buf_item_put(), for example.
> 
> IOW, I think in the simple (non-recovery) case of a buffer being read,
> modified and committed by a transaction, the bli would eventually end up
> in a state where bli_refcount == 0 but is still resident in the AIL
> until eventually written back by xfsaild. That metadata writeback
> completion is what eventually frees the bli via xfs_buf_item_done().
> 
> So if I'm not mistaken wrt to the above example sequence, the
> interesting question is if we suppose a buffer is in that intermediate
> state of waiting for writeback, and then somebody were to hypothetically
> execute a bit of code that simply added the associated buffer to a
> delwri q and immediately cancelled it, what would happen with this
> change in place? ISTM the remove would prematurely free the buffer/bli
> while it's still resident in the AIL and pending writeback, thus
> resulting in use-after-free or potential memory/list corruption, etc. Is
> that not the case?

Yes, if buf item still in the AIL, it is obviously not right to release
the buf item.

Thanks,
Long Li
Brian Foster Jan. 15, 2024, 2:14 p.m. UTC | #16
On Mon, Jan 15, 2024 at 09:31:03PM +0800, Long Li wrote:
> On Fri, Jan 12, 2024 at 01:42:32PM -0500, Brian Foster wrote:
> > On Fri, Jan 12, 2024 at 08:55:47PM +0800, Long Li wrote:
> > > On Thu, Jan 11, 2024 at 11:47:47AM +1100, Dave Chinner wrote:
> > > > On Thu, Dec 28, 2023 at 08:46:46PM +0800, Long Li wrote:
> > > > > In order to make sure that submits buffers on lsn boundaries in the
> > > > > abnormal paths, we need to check error status before submit buffers that
> > > > > have been added from the last record processed. If error status exist,
> > > > > buffers in the bufffer_list should be canceled.
> > > > > 
> > > > > Canceling the buffers in the buffer_list directly isn't correct, unlike
> > > > > any other place where write list was canceled, these buffers has been
> > > > > initialized by xfs_buf_item_init() during recovery and held by buf
> > > > > item, buf items will not be released in xfs_buf_delwri_cancel(). If
> > > > > these buffers are submitted successfully, buf items assocated with
> > > > > the buffer will be released in io end process. So releasing buf item
> > > > > in write list cacneling process is needed.
> > > > 
> > > > I still don't think this is correct.
> > > > 
> > > > > Fixes: 50d5c8d8e938 ("xfs: check LSN ordering for v5 superblocks during recovery")
> > > > > Signed-off-by: Long Li <leo.lilong@huawei.com>
> > > > > ---
> > > > >  fs/xfs/xfs_buf.c         |  2 ++
> > > > >  fs/xfs/xfs_log_recover.c | 22 +++++++++++++---------
> > > > >  2 files changed, 15 insertions(+), 9 deletions(-)
> > > > > 
> > > > > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> > > > > index 8e5bd50d29fe..6a1b26aaf97e 100644
> > > > > --- a/fs/xfs/xfs_buf.c
> > > > > +++ b/fs/xfs/xfs_buf.c
> > > > > @@ -2075,6 +2075,8 @@ xfs_buf_delwri_cancel(
> > > > >  		xfs_buf_lock(bp);
> > > > >  		bp->b_flags &= ~_XBF_DELWRI_Q;
> > > > >  		xfs_buf_list_del(bp);
> > > > > +		if (bp->b_log_item)
> > > > > +			xfs_buf_item_relse(bp);
> > > > >  		xfs_buf_relse(bp);
> > > > 
> > > > I still don't think this is safe.  The buffer log item might still be
> > > > tracked in the AIL when the delwri list is cancelled, so the delwri
> > > > list cancelling cannot release the BLI without removing the item
> > > > from the AIL, too. The delwri cancelling walk really shouldn't be
> > > > screwing with AIL state, which means it can't touch the BLIs here.
> > > > 
> > > > At minimum, it's a landmine for future users of
> > > > xfs_buf_delwri_cancel().  A quick look at the quotacheck code
> > > > indicates that it can cancel delwri lists that have BLIs in the AIL
> > > > (for newly allocated dquot chunks), so I think this is a real concern.
> > > > 
> > > > This is one of the reasons for submitting the delwri list on error;
> > > > the IO completion code does all the correct cleanup of log items
> > > > including removing them from the AIL because the buffer is now
> > > > either clean or stale and no longer needs to be tracked by the AIL.
> > > 
> > > Yes, it's not a safety solution.
> > > 
> > > > 
> > > > If the filesystem has been shut down, then delwri list submission
> > > > will error out all buffers on the list via IO submission/completion
> > > > and do all the correct cleanup automatically.
> > > > 
> > > > I note that write IO errors during log recovery will cause immediate
> > > > shutdown of the filesytsem via xfs_buf_ioend_handle_error():
> > > > 
> > > > 	/*
> > > >          * We're not going to bother about retrying this during recovery.
> > > >          * One strike!
> > > >          */
> > > >         if (bp->b_flags & _XBF_LOGRECOVERY) {
> > > >                 xfs_force_shutdown(mp, SHUTDOWN_META_IO_ERROR);
> > > >                 return false;
> > > >         }
> > > > 
> > > > So I'm guessing that the IO error injection error that caused this
> > > > failure was on a buffer read part way through recovering items.
> > > > 
> > > > Can you confirm that the failure is only seen after read IO error
> > > > injection and that write IO error injection causes immediate
> > > > shutdown and so avoids the problem altogether?
> > > 
> > > This problem reproduce very hard, we reproduce it only three times.
> > > There may be several mounts between writing buffer not on LSN boundaries
> > > and reporting free space btree corruption, I can't distinguish the
> > > violation happend in which mount during test. So judging by the message
> > > I've reprodced, I can't confirm that the failure is only seen after read
> > > IO error injection. Look at one of the kernel message I've reprodced,
> > > there are several mount fails before reporting free space btree corruption,
> > > the reasons of mount fail include read IO error and write IO error.
> > > 
> > > [51555.801349] XFS (dm-3): Mounting V5 Filesystem
> > > [51555.982130] XFS (dm-3): Starting recovery (logdev: internal)
> > > [51558.153638] FAULT_INJECTION: forcing a failure.
> > >                name fail_make_request, interval 20, probability 1, space 0, times -1
> > > [51558.153723] XFS (dm-3): log recovery read I/O error at daddr 0x3972 len 1 error -5
> > > [51558.165996] XFS (dm-3): log mount/recovery failed: error -5
> > > [51558.166880] XFS (dm-3): log mount failed
> > > [51558.410963] XFS (dm-3): EXPERIMENTAL big timestamp feature in use. Use at your own risk!
> > > [51558.410981] XFS (dm-3): EXPERIMENTAL inode btree counters feature in use. Use at your own risk!
> > > [51558.413074] XFS (dm-3): Mounting V5 Filesystem
> > > [51558.595739] XFS (dm-3): Starting recovery (logdev: internal)
> > > [51559.592552] FAULT_INJECTION: forcing a failure.
> > >                name fail_make_request, interval 20, probability 1, space 0, times -1
> > > [51559.593008] XFS (dm-3): metadata I/O error in "xfs_buf_ioend_handle_error+0x170/0x760 [xfs]" at daddr 0x1879e0 len 32 error 5
> > > [51559.593335] XFS (dm-3): Metadata I/O Error (0x1) detected at xfs_buf_ioend_handle_error+0x63c/0x760 [xfs] (fs/xfs/xfs_buf.c:1272).  Shutting down filesystem.
> > > [51559.593346] XFS (dm-3): Please unmount the filesystem and rectify the problem(s)
> > > [51559.602833] XFS (dm-3): log mount/recovery failed: error -5
> > > [51559.603772] XFS (dm-3): log mount failed
> > > [51559.835690] XFS (dm-3): EXPERIMENTAL big timestamp feature in use. Use at your own risk!
> > > [51559.835708] XFS (dm-3): EXPERIMENTAL inode btree counters feature in use. Use at your own risk!
> > > [51559.837829] XFS (dm-3): Mounting V5 Filesystem
> > > [51560.024083] XFS (dm-3): Starting recovery (logdev: internal)
> > > [51562.155545] FAULT_INJECTION: forcing a failure.
> > >                name fail_make_request, interval 20, probability 1, space 0, times -1
> > > [51562.445074] XFS (dm-3): Ending recovery (logdev: internal)
> > > [51563.553960] XFS (dm-3): Internal error ltbno + ltlen > bno at line 1976 of file fs/xfs/libxfs/xfs_alloc.c.  Caller xfs_free_ag_extent+0x558/0xd80 [xfs]
> > > [51563.558629] XFS (dm-3): Corruption detected. Unmount and run xfs_repair
> > > 
> > > > 
> > > > If so, then all we need to do to handle instantiation side errors (EIO, ENOMEM,
> > > > etc) is this:
> > > > 
> > > > 	/*
> > > > 	 * Submit buffers that have been dirtied by the last record recovered.
> > > > 	 */
> > > > 	if (!list_empty(&buffer_list)) {
> > > > 		if (error) {
> > > > 			/*
> > > > 			 * If there has been an item recovery error then we
> > > > 			 * cannot allow partial checkpoint writeback to
> > > > 			 * occur.  We might have multiple checkpoints with the
> > > > 			 * same start LSN in this buffer list, and partial
> > > > 			 * writeback of a checkpoint in this situation can
> > > > 			 * prevent future recovery of all the changes in the
> > > > 			 * checkpoints at this start LSN.
> > > > 			 *
> > > > 			 * Note: Shutting down the filesystem will result in the
> > > > 			 * delwri submission marking all the buffers stale,
> > > > 			 * completing them and cleaning up _XBF_LOGRECOVERY
> > > > 			 * state without doing any IO.
> > > > 			 */
> > > > 			xlog_force_shutdown(log, SHUTDOWN_LOG_IO_ERROR);
> > > > 		}
> > > > 		error2 = xfs_buf_delwri_submit(&buffer_list);
> > > > 	}
> > > >
> > > 
> > > This solution is also used in our internally maintained linux branch,
> > > and after several months of testing, the problem no longer arises. It
> > > seems safe and reasonable enough.
> > > 
> > 
> > I assume you're referring to the xfs_buf_delwri_cancel() change..? If
> 
> I'm not referring to the xfs_buf_delwri_cancel() solution, but to the
> use of the xlog_force_shutdown() solution. We believe that the shutdown
> solution is less risky because buffer list can be cleanup automatically
> via IO submission/completion, it would not change any other logic, so
> we've used it in our internally maintained linux branch.
> 
> On the other hand, I thought it would be better to positively cancel
> the buffer list, so I sent it out, but I overlooked potential issues... 
> 

Ah, Ok. Sorry for the confusion. Thanks.

Brian

> > so, this is a valid data point but doesn't necessarily help explain
> > whether the change is correct in any other context. I/O cancel probably
> > doesn't happen often for one, and even if it does, it's not totally
> > clear if you're reproducing a situation where the item might be AIL
> > resident or not at the time (or it is and you have a use after free that
> > goes undetected). And even if none of that is relevant, that still
> > doesn't protect against future code changes if this code doesn't respect
> > the established bli lifecycle rules.
> 
> Yes, agree with you.
> 
> > 
> > IIRC, the bli_refcount (sampled via xfs_buf_item_relse()) is not
> > necessarily an object lifecycle refcount. It simply reflects whether the
> > item exists in a transaction where it might eventually be dirtied. This
> > is somewhat tricky, but can also be surmised from some of the logic in
> > xfs_buf_item_put(), for example.
> > 
> > IOW, I think in the simple (non-recovery) case of a buffer being read,
> > modified and committed by a transaction, the bli would eventually end up
> > in a state where bli_refcount == 0 but is still resident in the AIL
> > until eventually written back by xfsaild. That metadata writeback
> > completion is what eventually frees the bli via xfs_buf_item_done().
> > 
> > So if I'm not mistaken wrt to the above example sequence, the
> > interesting question is if we suppose a buffer is in that intermediate
> > state of waiting for writeback, and then somebody were to hypothetically
> > execute a bit of code that simply added the associated buffer to a
> > delwri q and immediately cancelled it, what would happen with this
> > change in place? ISTM the remove would prematurely free the buffer/bli
> > while it's still resident in the AIL and pending writeback, thus
> > resulting in use-after-free or potential memory/list corruption, etc. Is
> > that not the case?
> 
> Yes, if buf item still in the AIL, it is obviously not right to release
> the buf item.
> 
> Thanks,
> Long Li
>
diff mbox series

Patch

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 8e5bd50d29fe..6a1b26aaf97e 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -2075,6 +2075,8 @@  xfs_buf_delwri_cancel(
 		xfs_buf_lock(bp);
 		bp->b_flags &= ~_XBF_DELWRI_Q;
 		xfs_buf_list_del(bp);
+		if (bp->b_log_item)
+			xfs_buf_item_relse(bp);
 		xfs_buf_relse(bp);
 	}
 }
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index 1251c81e55f9..2cda6c90890d 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -2964,7 +2964,6 @@  xlog_do_recovery_pass(
 	char			*offset;
 	char			*hbp, *dbp;
 	int			error = 0, h_size, h_len;
-	int			error2 = 0;
 	int			bblks, split_bblks;
 	int			hblks, split_hblks, wrapped_hblks;
 	int			i;
@@ -3203,16 +3202,21 @@  xlog_do_recovery_pass(
  bread_err1:
 	kmem_free(hbp);
 
-	/*
-	 * Submit buffers that have been added from the last record processed,
-	 * regardless of error status.
-	 */
-	if (!list_empty(&buffer_list))
-		error2 = xfs_buf_delwri_submit(&buffer_list);
-
 	if (error && first_bad)
 		*first_bad = rhead_blk;
 
+	/*
+	 * If there are no error, submit buffers that have been added from the
+	 * last record processed, othrewise cancel the write list, to ensure
+	 * submit buffers on LSN boundaries.
+	 */
+	if (!list_empty(&buffer_list)) {
+		if (error)
+			xfs_buf_delwri_cancel(&buffer_list);
+		else
+			error = xfs_buf_delwri_submit(&buffer_list);
+	}
+
 	/*
 	 * Transactions are freed at commit time but transactions without commit
 	 * records on disk are never committed. Free any that may be left in the
@@ -3226,7 +3230,7 @@  xlog_do_recovery_pass(
 			xlog_recover_free_trans(trans);
 	}
 
-	return error ? error : error2;
+	return error;
 }
 
 /*