diff mbox series

[2/3] xfs: failed delalloc conversion results in bad extent lists

Message ID 20230214055114.4141947-3-david@fromorbit.com (mailing list archive)
State New, archived
Headers show
Series xfs, iomap: fix writeback failure handling | expand

Commit Message

Dave Chinner Feb. 14, 2023, 5:51 a.m. UTC
From: Dave Chinner <dchinner@redhat.com>

If we fail delayed allocation conversion because we cannot allocate
blocks, we end up in the situation where the inode extent list is
effectively corrupt and unresolvable. Whilst we have dirty data in
memory that we cannot allocate space for, we cannot write that data
back to disk. Unmounting a filesystem in this state results in data
loss.

In situations where we end up with a corrupt extent list in memory,
we can also be asked to convert a delayed region that does not have
a delalloc extent backing it. This should be considered a
corruption, too, not a "try again later" error.

These conversion failures result in the inode being sick and needing
repair, but we don't mark all the conditions that can lead to bmap
sickness being marked. Make sure that the error conditions that
indicate corruption are properly marked.

Further, if we trip over these corruptions conditions, we then have
to reclaim an inode that has unresolvable delayed allocation extents
attached to the inode. This generally happens at unmount and inode
inactivation will fire assert failures because we've left stray
delayed allocation extents behind on the inode. Hence we need to
ensure that we only trigger the stale delalloc extent checks if the
inode is fully healthy.

Even then, this may not be enough, because the inactivation code
assumes that there will be no stray delayed extents unless the
filesystem is shut down. If the inode is unhealthy, we need to
ensure we clean up delayed allocation extents within EOF because
the VFS has already tossed the data away. Hence there's no longer
any data over the delalloc extents to write back, so we need to also
toss the delayed allocation extents to ensure we release the space
reservation the delalloc extent holds. Failure to punch delalloc
extents in this case results in assert failures during unmount when
the delalloc block counter is torn down.

This all needs to be in place before the next patch which resolves a
bug in the iomap code that discards delalloc extents backing dirty
pages on writeback error without discarding the dirty data. Hence we
need to be able to handle delalloc extents in inode cleanup sanely,
rather than rely on incorrectly punching the delalloc extents on the
first writeback error that occurs.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_bmap.c | 13 ++++++++++---
 fs/xfs/xfs_icache.c      |  4 +++-
 fs/xfs/xfs_inode.c       | 10 ++++++++++
 3 files changed, 23 insertions(+), 4 deletions(-)

Comments

Christoph Hellwig Feb. 14, 2023, 8:13 a.m. UTC | #1
On Tue, Feb 14, 2023 at 04:51:13PM +1100, Dave Chinner wrote:
> index 958e4bb2e51e..fb718a5825d5 100644
> --- a/fs/xfs/libxfs/xfs_bmap.c
> +++ b/fs/xfs/libxfs/xfs_bmap.c
> @@ -4553,8 +4553,12 @@ xfs_bmapi_convert_delalloc(
>  		 * should only happen for the COW fork, where another thread
>  		 * might have moved the extent to the data fork in the meantime.
>  		 */
> -		WARN_ON_ONCE(whichfork != XFS_COW_FORK);
> -		error = -EAGAIN;
> +		if (whichfork != XFS_COW_FORK) {
> +			xfs_bmap_mark_sick(ip, whichfork);
> +			error = -EFSCORRUPTED;
> +		} else {
> +			error = -EAGAIN;
> +		}

The comment above should probably be expanded a bit on what this means
for a non-cow fork extent and how we'll handle it later.

> +	if (error) {
> +		if ((error == -EFSCORRUPTED) || (error == -EFSBADCRC))

Nit: no need for the inner braces.

>  
> +		/*
> +		 * If the inode is sick, then it might have delalloc extents
> +		 * within EOF that we were unable to convert. We have to punch
> +		 * them out here to release the reservation as there is no
> +		 * longer any data to write back into the delalloc range now.
> +		 */
> +		if (!xfs_inode_is_healthy(ip))
> +			xfs_bmap_punch_delalloc_range(ip, 0,
> +						i_size_read(VFS_I(ip)));

Is i_size_read the right check here?  The delalloc extent could extend
past i_size if i_size is not block aligned.  Can't we just simply pass
(xfs_off_t)-1 here?
Dave Chinner Feb. 14, 2023, 10:26 p.m. UTC | #2
On Tue, Feb 14, 2023 at 12:13:19AM -0800, Christoph Hellwig wrote:
> On Tue, Feb 14, 2023 at 04:51:13PM +1100, Dave Chinner wrote:
> > index 958e4bb2e51e..fb718a5825d5 100644
> > --- a/fs/xfs/libxfs/xfs_bmap.c
> > +++ b/fs/xfs/libxfs/xfs_bmap.c
> > @@ -4553,8 +4553,12 @@ xfs_bmapi_convert_delalloc(
> >  		 * should only happen for the COW fork, where another thread
> >  		 * might have moved the extent to the data fork in the meantime.
> >  		 */
> > -		WARN_ON_ONCE(whichfork != XFS_COW_FORK);
> > -		error = -EAGAIN;
> > +		if (whichfork != XFS_COW_FORK) {
> > +			xfs_bmap_mark_sick(ip, whichfork);
> > +			error = -EFSCORRUPTED;
> > +		} else {
> > +			error = -EAGAIN;
> > +		}
> 
> The comment above should probably be expanded a bit on what this means
> for a non-cow fork extent and how we'll handle it later.
> 
> > +	if (error) {
> > +		if ((error == -EFSCORRUPTED) || (error == -EFSBADCRC))
> 
> Nit: no need for the inner braces.
> 
> >  
> > +		/*
> > +		 * If the inode is sick, then it might have delalloc extents
> > +		 * within EOF that we were unable to convert. We have to punch
> > +		 * them out here to release the reservation as there is no
> > +		 * longer any data to write back into the delalloc range now.
> > +		 */
> > +		if (!xfs_inode_is_healthy(ip))
> > +			xfs_bmap_punch_delalloc_range(ip, 0,
> > +						i_size_read(VFS_I(ip)));
> 
> Is i_size_read the right check here?  The delalloc extent could extend
> past i_size if i_size is not block aligned.  Can't we just simply pass
> (xfs_off_t)-1 here?

Probably, we just killed all the delalloc blocks beyond eof via
xfs_free_eofblocks() in the line above this, so I didn't seem
necessary to try to punch blocks beyond EOF for this case. Easy
enough to do to be safe, just need a comment update to go with
it....

Cheers,

Dave.
> 
>
diff mbox series

Patch

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 958e4bb2e51e..fb718a5825d5 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -4553,8 +4553,12 @@  xfs_bmapi_convert_delalloc(
 		 * should only happen for the COW fork, where another thread
 		 * might have moved the extent to the data fork in the meantime.
 		 */
-		WARN_ON_ONCE(whichfork != XFS_COW_FORK);
-		error = -EAGAIN;
+		if (whichfork != XFS_COW_FORK) {
+			xfs_bmap_mark_sick(ip, whichfork);
+			error = -EFSCORRUPTED;
+		} else {
+			error = -EAGAIN;
+		}
 		goto out_trans_cancel;
 	}
 
@@ -4598,8 +4602,11 @@  xfs_bmapi_convert_delalloc(
 		bma.prev.br_startoff = NULLFILEOFF;
 
 	error = xfs_bmapi_allocate(&bma);
-	if (error)
+	if (error) {
+		if ((error == -EFSCORRUPTED) || (error == -EFSBADCRC))
+			xfs_bmap_mark_sick(ip, whichfork);
 		goto out_finish;
+	}
 
 	error = -ENOSPC;
 	if (WARN_ON_ONCE(bma.blkno == NULLFSBLOCK))
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index ddeaccc04aec..4354b6639dec 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -24,6 +24,7 @@ 
 #include "xfs_ialloc.h"
 #include "xfs_ag.h"
 #include "xfs_log_priv.h"
+#include "xfs_health.h"
 
 #include <linux/iversion.h>
 
@@ -1810,7 +1811,8 @@  xfs_inodegc_set_reclaimable(
 	struct xfs_mount	*mp = ip->i_mount;
 	struct xfs_perag	*pag;
 
-	if (!xfs_is_shutdown(mp) && ip->i_delayed_blks) {
+	if (ip->i_delayed_blks && xfs_inode_is_healthy(ip) &&
+	    !xfs_is_shutdown(mp)) {
 		xfs_check_delalloc(ip, XFS_DATA_FORK);
 		xfs_check_delalloc(ip, XFS_COW_FORK);
 		ASSERT(0);
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index d354ea2b74f9..06f1229ef628 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -37,6 +37,7 @@ 
 #include "xfs_reflink.h"
 #include "xfs_ag.h"
 #include "xfs_log_priv.h"
+#include "xfs_health.h"
 
 struct kmem_cache *xfs_inode_cache;
 
@@ -1738,6 +1739,15 @@  xfs_inactive(
 		if (xfs_can_free_eofblocks(ip, true))
 			xfs_free_eofblocks(ip);
 
+		/*
+		 * If the inode is sick, then it might have delalloc extents
+		 * within EOF that we were unable to convert. We have to punch
+		 * them out here to release the reservation as there is no
+		 * longer any data to write back into the delalloc range now.
+		 */
+		if (!xfs_inode_is_healthy(ip))
+			xfs_bmap_punch_delalloc_range(ip, 0,
+						i_size_read(VFS_I(ip)));
 		goto out;
 	}