[2/2] xfs: Prevent deadlock when allocating blocks for AGFL

Message ID	20210428065152.77280-2-chandanrlinux@gmail.com (mailing list archive)
State	Deferred, archived
Headers	show Return-Path: <linux-xfs-owner@kernel.org> From: Chandan Babu R <chandanrlinux@gmail.com> To: linux-xfs@vger.kernel.org Cc: Chandan Babu R <chandanrlinux@gmail.com> Subject: [PATCH 2/2] xfs: Prevent deadlock when allocating blocks for AGFL Date: Wed, 28 Apr 2021 12:21:52 +0530 Message-Id: <20210428065152.77280-2-chandanrlinux@gmail.com> In-Reply-To: <20210428065152.77280-1-chandanrlinux@gmail.com> References: <20210428065152.77280-1-chandanrlinux@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	[1/2] xfs: Introduce XFS_EXTENT_BUSY_IN_TRANS busy extent flag \| expand [1/2] xfs: Introduce XFS_EXTENT_BUSY_IN_TRANS busy extent flag [2/2] xfs: Prevent deadlock when allocating blocks for AGFL

Chandan Babu R April 28, 2021, 6:51 a.m. UTC

Executing xfs/538 after disabling injection of bmap_alloc_minlen_extent error
can cause several tasks to trigger hung task timeout. Most of the tasks are
blocked on getting a lock on an AG's AGF buffer. However, The task which has
the lock on the AG's AGF buffer has the following call trace,

PID: 1341   TASK: ffff8881073f3700  CPU: 1   COMMAND: "fsstress"
   __schedule+0x22f at ffffffff81f75e8f
   schedule+0x46 at ffffffff81f76366
   xfs_extent_busy_flush+0x69 at ffffffff81477d99
   xfs_alloc_ag_vextent_size+0x16a at ffffffff8141711a
   xfs_alloc_ag_vextent+0x19b at ffffffff81417edb
   xfs_alloc_fix_freelist+0x22f at ffffffff8141896f
   xfs_free_extent_fix_freelist+0x6a at ffffffff8141939a
   __xfs_free_extent+0x99 at ffffffff81419499
   xfs_trans_free_extent+0x3e at ffffffff814a6fee
   xfs_extent_free_finish_item+0x24 at ffffffff814a70d4
   xfs_defer_finish_noroll+0x1f7 at ffffffff81441407
   xfs_defer_finish+0x11 at ffffffff814417e1
   xfs_itruncate_extents_flags+0x13d at ffffffff8148b7dd
   xfs_inactive_truncate+0xb9 at ffffffff8148bb89
   xfs_inactive+0x227 at ffffffff8148c4f7
   xfs_fs_destroy_inode+0xb8 at ffffffff81496898
   destroy_inode+0x3b at ffffffff8127d2ab
   do_unlinkat+0x1d1 at ffffffff81270df1
   do_syscall_64+0x40 at ffffffff81f6b5f0
   entry_SYSCALL_64_after_hwframe+0x44 at ffffffff8200007c

The following sequence of events lead to the above listed call trace,

1. The task frees atleast two extents belonging to the file being truncated.
2. The corresponding xfs_extent_free_items are stored in the list pointed to
   by xfs_defer_pending->dfp_work.
3. When executing the next step of the rolling transaction, The first of the
   above mentioned extents is freed. The corresponding busy extent entry is
   added to the current transaction's tp->t_busy list as well as to the perag
   rb tree at xfs_perag->pagb_tree.
4. When trying to free the second extent, XFS determines that the AGFL needs
   to be populated and hence tries to allocate free blocks.
5. The only free extent whose size is >= xfs_alloc_arg->maxlen
   happens to be the first extent that was freed by the current transaction.
6. Hence xfs_alloc_ag_vextent_size() flushes the CIL in the hope of clearing
   the busy status of the extent and waits for the busy generation number to
   change.
7. However, flushing the CIL is futile since the busy extent is still in the
   current transaction's tp->t_busy list.

Here the task ends up waiting indefinitely.

This commit fixes the bug by preventing a CIL flush if all free extents are
busy and all of them are in the transaction's tp->t_busy list.

Signed-off-by: Chandan Babu R <chandanrlinux@gmail.com>
---
 fs/xfs/libxfs/xfs_alloc.c | 59 +++++++++++++++++++++++++++++----------
 fs/xfs/xfs_extent_busy.c  |  6 +++-
 fs/xfs/xfs_extent_busy.h  |  2 +-
 3 files changed, 51 insertions(+), 16 deletions(-)

Dave Chinner April 29, 2021, 1:12 a.m. UTC | #1

On Wed, Apr 28, 2021 at 12:21:52PM +0530, Chandan Babu R wrote:
> Executing xfs/538 after disabling injection of bmap_alloc_minlen_extent error
> can cause several tasks to trigger hung task timeout. Most of the tasks are
> blocked on getting a lock on an AG's AGF buffer. However, The task which has
> the lock on the AG's AGF buffer has the following call trace,
> 
> PID: 1341   TASK: ffff8881073f3700  CPU: 1   COMMAND: "fsstress"
>    __schedule+0x22f at ffffffff81f75e8f
>    schedule+0x46 at ffffffff81f76366
>    xfs_extent_busy_flush+0x69 at ffffffff81477d99
>    xfs_alloc_ag_vextent_size+0x16a at ffffffff8141711a
>    xfs_alloc_ag_vextent+0x19b at ffffffff81417edb
>    xfs_alloc_fix_freelist+0x22f at ffffffff8141896f
>    xfs_free_extent_fix_freelist+0x6a at ffffffff8141939a
>    __xfs_free_extent+0x99 at ffffffff81419499
>    xfs_trans_free_extent+0x3e at ffffffff814a6fee
>    xfs_extent_free_finish_item+0x24 at ffffffff814a70d4
>    xfs_defer_finish_noroll+0x1f7 at ffffffff81441407
>    xfs_defer_finish+0x11 at ffffffff814417e1
>    xfs_itruncate_extents_flags+0x13d at ffffffff8148b7dd
>    xfs_inactive_truncate+0xb9 at ffffffff8148bb89
>    xfs_inactive+0x227 at ffffffff8148c4f7
>    xfs_fs_destroy_inode+0xb8 at ffffffff81496898
>    destroy_inode+0x3b at ffffffff8127d2ab
>    do_unlinkat+0x1d1 at ffffffff81270df1
>    do_syscall_64+0x40 at ffffffff81f6b5f0
>    entry_SYSCALL_64_after_hwframe+0x44 at ffffffff8200007c
> 
> The following sequence of events lead to the above listed call trace,
> 
> 1. The task frees atleast two extents belonging to the file being truncated.
> 2. The corresponding xfs_extent_free_items are stored in the list pointed to
>    by xfs_defer_pending->dfp_work.
> 3. When executing the next step of the rolling transaction, The first of the
>    above mentioned extents is freed. The corresponding busy extent entry is
>    added to the current transaction's tp->t_busy list as well as to the perag
>    rb tree at xfs_perag->pagb_tree.
> 4. When trying to free the second extent, XFS determines that the AGFL needs
>    to be populated and hence tries to allocate free blocks.
> 5. The only free extent whose size is >= xfs_alloc_arg->maxlen
>    happens to be the first extent that was freed by the current transaction.
> 6. Hence xfs_alloc_ag_vextent_size() flushes the CIL in the hope of clearing
>    the busy status of the extent and waits for the busy generation number to
>    change.
> 7. However, flushing the CIL is futile since the busy extent is still in the
>    current transaction's tp->t_busy list.
> 
> Here the task ends up waiting indefinitely.
> 
> This commit fixes the bug by preventing a CIL flush if all free extents are
> busy and all of them are in the transaction's tp->t_busy list.

Hmmm. I don't doubt that this fixes the symptom you are seeing, but
the way it is being fixed doesn't seem right to me at all.

We're rtying to populate the AGFL here, and the fact is that a
multi-block allocation is simply an optimisation to minimise the
number of extents we need to allocate to fill the AGFL. The extent
that gets allocated gets broken up into single blocks to be inserted
into the AGFL, so we don't actually need a continuguous extent to be
allocated here.

Hence, if the extent we find is busy when allocating for the AGFL,
we should just skip it and choose another extent. args->minlen is
set to zero for the allocation, so we can actually return any extent
that has a length <= args->maxlen. We know this is an AGFL
allocation because args->resv == XFS_AG_RESV_AGFL, so if we find a
busy extent that would require a log force to be able to use before
we can place it in the AGFL, we should just skip it entirely and
select another extent to allocate from.

Adding another two boolean conditionals to the already complex
extent selection for this specific case makes the code much harder
to follow and reason about. I'd much prefer that we just do
something like:

	if (busy && args->resv == XFS_AG_RESV_AGFL) {
		/*
		 * Extent might have just been freed in this
		 * transaction so we can't use it. Move to the next
		 * best extent candidate and try that instead.
		 */
		<increment/decrement and continue the search loop>
	}

IOWs, we should not be issuing a log force to flush busy extents if
we can't use the largest candidate free extent for the AGFL - we
should just keep searching until we find one we can use....

Cheers,

Dave.

Chandan Babu R April 30, 2021, 1:40 p.m. UTC | #2

On 29 Apr 2021 at 06:42, Dave Chinner wrote:
> On Wed, Apr 28, 2021 at 12:21:52PM +0530, Chandan Babu R wrote:
>> Executing xfs/538 after disabling injection of bmap_alloc_minlen_extent error
>> can cause several tasks to trigger hung task timeout. Most of the tasks are
>> blocked on getting a lock on an AG's AGF buffer. However, The task which has
>> the lock on the AG's AGF buffer has the following call trace,
>>
[..]
> Hmmm. I don't doubt that this fixes the symptom you are seeing, but
> the way it is being fixed doesn't seem right to me at all.
>
> We're rtying to populate the AGFL here, and the fact is that a
> multi-block allocation is simply an optimisation to minimise the
> number of extents we need to allocate to fill the AGFL. The extent
> that gets allocated gets broken up into single blocks to be inserted
> into the AGFL, so we don't actually need a continuguous extent to be
> allocated here.
>
> Hence, if the extent we find is busy when allocating for the AGFL,
> we should just skip it and choose another extent. args->minlen is
> set to zero for the allocation, so we can actually return any extent
> that has a length <= args->maxlen. We know this is an AGFL
> allocation because args->resv == XFS_AG_RESV_AGFL, so if we find a
> busy extent that would require a log force to be able to use before
> we can place it in the AGFL, we should just skip it entirely and
> select another extent to allocate from.
>
> Adding another two boolean conditionals to the already complex
> extent selection for this specific case makes the code much harder
> to follow and reason about. I'd much prefer that we just do
> something like:
>
> 	if (busy && args->resv == XFS_AG_RESV_AGFL) {
> 		/*
> 		 * Extent might have just been freed in this
> 		 * transaction so we can't use it. Move to the next
> 		 * best extent candidate and try that instead.
> 		 */
> 		<increment/decrement and continue the search loop>
> 	}
>
> IOWs, we should not be issuing a log force to flush busy extents if
> we can't use the largest candidate free extent for the AGFL - we
> should just keep searching until we find one we can use....

IIUC, the following patch implements the solution that has been suggested
above,

diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index aaa19101bb2a..25456dbff767 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -1694,6 +1694,7 @@ xfs_alloc_ag_vextent_size(
 	 * are no smaller extents available.
 	 */
 	if (!i) {
+alloc_small_extent:
 		error = xfs_alloc_ag_vextent_small(args, cnt_cur,
 						   &fbno, &flen, &i);
 		if (error)
@@ -1707,6 +1708,8 @@ xfs_alloc_ag_vextent_size(
 		busy = xfs_alloc_compute_aligned(args, fbno, flen, &rbno,
 				&rlen, &busy_gen);
 	} else {
+		xfs_agblock_t	orig_fbno = NULLAGBLOCK;
+		xfs_extlen_t	orig_flen;
 		/*
 		 * Search for a non-busy extent that is large enough.
 		 */
@@ -1719,6 +1722,11 @@ xfs_alloc_ag_vextent_size(
 				goto error0;
 			}

+			if (orig_fbno == NULLAGBLOCK) {
+				orig_fbno = fbno;
+				orig_flen = flen;
+			}
+
 			busy = xfs_alloc_compute_aligned(args, fbno, flen,
 					&rbno, &rlen, &busy_gen);

@@ -1734,6 +1742,14 @@ xfs_alloc_ag_vextent_size(
 				 * Make it unbusy by forcing the log out and
 				 * retrying.
 				 */
+				if (args->resv == XFS_AG_RESV_AGFL) {
+					error = xfs_alloc_lookup_eq(cnt_cur,
+							orig_fbno, orig_flen, &i);
+					ASSERT(!error && i);
+
+					goto alloc_small_extent;
+				}
+
 				xfs_btree_del_cursor(cnt_cur,
 						     XFS_BTREE_NOERROR);
 				trace_xfs_alloc_size_busy(args);
@@ -1819,7 +1835,7 @@ xfs_alloc_ag_vextent_size(
 	 */
 	args->len = rlen;
 	if (rlen < args->minlen) {
-		if (busy) {
+		if (busy && args->resv != XFS_AG_RESV_AGFL) {
 			xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR);
 			trace_xfs_alloc_size_busy(args);
 			xfs_extent_busy_flush(args->mp, args->pag, busy_gen);

i.e.  when we end up at the right most edge of the cntbt during allocation of
blocks for refilling AGFL, the above patch backtracks and continues search
towards the left edge of the cntbt instead of flushing the CIL. If the
leftmost edge is reached without finding any suitable free extent and the
blocks are being allocated for AGFL, the function returns back to the caller
instead of flushing the CIL and retrying once again.

With the above patch, a workload which consists of,
1. Filling up 90% of the free space of the filesystem.
2. Punch alternate blocks of files.

.. would cause failure when inserting records into either cntbt/bnobt due to
unavailability of AGFL blocks.

This happens because most of the free blocks resulting from punching out
alternate blocks would be residing in the CIL's extent busy list. xfs/538
creates 1G sized scratch filesystem and the "punch alternate blocks" workload
creates a little more than 8000 entries in the CIL extent busy list.

So, may be there are no other alternatives other than to flush the CIL. To
that end, I have tried to slightly simplify the patch that I had originally
sent (i.e. [PATCH 2/2] xfs: Prevent deadlock when allocating blocks for
AGFL). The new patch removes one the boolean variables
(i.e. alloc_small_extent) and also skips redundant searching of extent records
when backtracking in preparation for searching smaller extents.

diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index 7dc50a435cf4..ea01c2674247 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -274,7 +274,8 @@ xfs_alloc_compute_aligned(
 	xfs_extlen_t	foundlen,	/* length in found extent */
 	xfs_agblock_t	*resbno,	/* result block number */
 	xfs_extlen_t	*reslen,	/* result length */
-	unsigned	*busy_gen)
+	unsigned	*busy_gen,
+	bool		*busy_in_trans)
 {
 	xfs_agblock_t	bno = foundbno;
 	xfs_extlen_t	len = foundlen;
@@ -282,7 +283,7 @@ xfs_alloc_compute_aligned(
 	bool		busy;

 	/* Trim busy sections out of found extent */
-	busy = xfs_extent_busy_trim(args, &bno, &len, busy_gen);
+	busy = xfs_extent_busy_trim(args, &bno, &len, busy_gen, busy_in_trans);

 	/*
 	 * If we have a largish extent that happens to start before min_agbno,
@@ -852,7 +853,7 @@ xfs_alloc_cur_check(
 	}

 	busy = xfs_alloc_compute_aligned(args, bno, len, &bnoa, &lena,
-					 &busy_gen);
+					 &busy_gen, NULL);
 	acur->busy |= busy;
 	if (busy)
 		acur->busy_gen = busy_gen;
@@ -1248,7 +1249,7 @@ xfs_alloc_ag_vextent_exact(
 	 */
 	tbno = fbno;
 	tlen = flen;
-	xfs_extent_busy_trim(args, &tbno, &tlen, &busy_gen);
+	xfs_extent_busy_trim(args, &tbno, &tlen, &busy_gen, NULL);

 	/*
 	 * Give up if the start of the extent is busy, or the freespace isn't
@@ -1669,6 +1670,8 @@ xfs_alloc_ag_vextent_size(
 	xfs_extlen_t	rlen;		/* length of returned extent */
 	bool		busy;
 	unsigned	busy_gen;
+	bool		busy_in_trans;
+	bool		all_busy_in_trans;

 restart:
 	/*
@@ -1677,6 +1680,7 @@ xfs_alloc_ag_vextent_size(
 	cnt_cur = xfs_allocbt_init_cursor(args->mp, args->tp, args->agbp,
 		args->agno, XFS_BTNUM_CNT);
 	bno_cur = NULL;
+	all_busy_in_trans = true;
 	busy = false;

 	/*
@@ -1687,13 +1691,15 @@ xfs_alloc_ag_vextent_size(
 		goto error0;

 	/*
-	 * If none then we have to settle for a smaller extent. In the case that
-	 * there are no large extents, this will return the last entry in the
-	 * tree unless the tree is empty. In the case that there are only busy
-	 * large extents, this will return the largest small extent unless there
-	 * are no smaller extents available.
+	 * We have to settle for a smaller extent if there are no maxlen +
+	 * alignment - 1 sized extents or if all larger free extents are still
+	 * in current transaction's busy list. In either case, this will return
+	 * the last entry in the tree unless the tree is empty. In the case that
+	 * there are only busy large extents, this will return the largest small
+	 * extent unless there are no smaller extents available.
 	 */
 	if (!i) {
+alloc_small_extent:
 		error = xfs_alloc_ag_vextent_small(args, cnt_cur,
 						   &fbno, &flen, &i);
 		if (error)
@@ -1705,8 +1711,12 @@ xfs_alloc_ag_vextent_size(
 		}
 		ASSERT(i == 1);
 		busy = xfs_alloc_compute_aligned(args, fbno, flen, &rbno,
-				&rlen, &busy_gen);
+				&rlen, &busy_gen, &busy_in_trans);
+		if (busy && !busy_in_trans)
+			all_busy_in_trans = false;
 	} else {
+		xfs_agblock_t	orig_fbno = NULLAGBLOCK;
+		xfs_extlen_t	orig_flen;
 		/*
 		 * Search for a non-busy extent that is large enough.
 		 */
@@ -1719,27 +1729,52 @@ xfs_alloc_ag_vextent_size(
 				goto error0;
 			}

+			if (orig_fbno == NULLAGBLOCK) {
+				orig_fbno = fbno;
+				orig_flen = flen;
+			}
+
 			busy = xfs_alloc_compute_aligned(args, fbno, flen,
-					&rbno, &rlen, &busy_gen);
+					&rbno, &rlen, &busy_gen,
+					&busy_in_trans);

 			if (rlen >= args->maxlen)
 				break;

+			if (busy && !busy_in_trans)
+				all_busy_in_trans = false;
+
 			error = xfs_btree_increment(cnt_cur, 0, &i);
 			if (error)
 				goto error0;
 			if (i == 0) {
+				if (!all_busy_in_trans) {
+					/*
+					 * Our only valid extents must have been busy.
+					 * Make it unbusy by forcing the log out and
+					 * retrying.
+					 */
+					xfs_btree_del_cursor(cnt_cur,
+							XFS_BTREE_NOERROR);
+					trace_xfs_alloc_size_busy(args);
+					xfs_extent_busy_flush(args->mp,
+							args->pag, busy_gen);
+					goto restart;
+				}
+
 				/*
-				 * Our only valid extents must have been busy.
-				 * Make it unbusy by forcing the log out and
-				 * retrying.
+				 * All the large free extents are busy in the
+				 * current transaction's t->t_busy list.  Hence
+				 * forcing the log will will be futile and also
+				 * leads to the current task waiting
+				 * indefinitely. Hence try to allocate smaller
+				 * extents.
 				 */
-				xfs_btree_del_cursor(cnt_cur,
-						     XFS_BTREE_NOERROR);
-				trace_xfs_alloc_size_busy(args);
-				xfs_extent_busy_flush(args->mp,
-							args->pag, busy_gen);
-				goto restart;
+				error = xfs_alloc_lookup_eq(cnt_cur, orig_fbno,
+						orig_flen, &i);
+				ASSERT(!error && i);
+
+				goto alloc_small_extent;
 			}
 		}
 	}
@@ -1783,7 +1818,10 @@ xfs_alloc_ag_vextent_size(
 			if (flen < bestrlen)
 				break;
 			busy = xfs_alloc_compute_aligned(args, fbno, flen,
-					&rbno, &rlen, &busy_gen);
+					&rbno, &rlen, &busy_gen,
+					&busy_in_trans);
+			if (busy && !busy_in_trans)
+				all_busy_in_trans = false;
 			rlen = XFS_EXTLEN_MIN(args->maxlen, rlen);
 			if (XFS_IS_CORRUPT(args->mp,
 					   rlen != 0 &&
@@ -1819,7 +1857,7 @@ xfs_alloc_ag_vextent_size(
 	 */
 	args->len = rlen;
 	if (rlen < args->minlen) {
-		if (busy) {
+		if (busy && !all_busy_in_trans) {
 			xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR);
 			trace_xfs_alloc_size_busy(args);
 			xfs_extent_busy_flush(args->mp, args->pag, busy_gen);
diff --git a/fs/xfs/xfs_extent_busy.c b/fs/xfs/xfs_extent_busy.c
index a4075685d9eb..16ba514f9e81 100644
--- a/fs/xfs/xfs_extent_busy.c
+++ b/fs/xfs/xfs_extent_busy.c
@@ -334,7 +334,8 @@ xfs_extent_busy_trim(
 	struct xfs_alloc_arg	*args,
 	xfs_agblock_t		*bno,
 	xfs_extlen_t		*len,
-	unsigned		*busy_gen)
+	unsigned		*busy_gen,
+	bool			*busy_in_trans)
 {
 	xfs_agblock_t		fbno;
 	xfs_extlen_t		flen;
@@ -362,6 +363,9 @@ xfs_extent_busy_trim(
 			continue;
 		}

+		if (busy_in_trans)
+			*busy_in_trans = busyp->flags & XFS_EXTENT_BUSY_IN_TRANS;
+
 		if (bbno <= fbno) {
 			/* start overlap */

diff --git a/fs/xfs/xfs_extent_busy.h b/fs/xfs/xfs_extent_busy.h
index 929f72d1c699..dcdc70821622 100644
--- a/fs/xfs/xfs_extent_busy.h
+++ b/fs/xfs/xfs_extent_busy.h
@@ -49,7 +49,7 @@ xfs_extent_busy_reuse(struct xfs_mount *mp, xfs_agnumber_t agno,

 bool
 xfs_extent_busy_trim(struct xfs_alloc_arg *args, xfs_agblock_t *bno,
-		xfs_extlen_t *len, unsigned *busy_gen);
+		xfs_extlen_t *len, unsigned *busy_gen, bool *busy_in_trans);

 void
 xfs_extent_busy_flush(struct xfs_mount *mp, struct xfs_perag *pag,

Please let me know your views on this.

--
chandan

Dave Chinner April 30, 2021, 10:44 p.m. UTC | #3

On Fri, Apr 30, 2021 at 07:10:31PM +0530, Chandan Babu R wrote:
> On 29 Apr 2021 at 06:42, Dave Chinner wrote:
> > On Wed, Apr 28, 2021 at 12:21:52PM +0530, Chandan Babu R wrote:
> >> Executing xfs/538 after disabling injection of bmap_alloc_minlen_extent error
> >> can cause several tasks to trigger hung task timeout. Most of the tasks are
> >> blocked on getting a lock on an AG's AGF buffer. However, The task which has
> >> the lock on the AG's AGF buffer has the following call trace,
> >>
> [..]
> > Hmmm. I don't doubt that this fixes the symptom you are seeing, but
> > the way it is being fixed doesn't seem right to me at all.
> >
> > We're rtying to populate the AGFL here, and the fact is that a
> > multi-block allocation is simply an optimisation to minimise the
> > number of extents we need to allocate to fill the AGFL. The extent
> > that gets allocated gets broken up into single blocks to be inserted
> > into the AGFL, so we don't actually need a continuguous extent to be
> > allocated here.
> >
> > Hence, if the extent we find is busy when allocating for the AGFL,
> > we should just skip it and choose another extent. args->minlen is
> > set to zero for the allocation, so we can actually return any extent
> > that has a length <= args->maxlen. We know this is an AGFL
> > allocation because args->resv == XFS_AG_RESV_AGFL, so if we find a
> > busy extent that would require a log force to be able to use before
> > we can place it in the AGFL, we should just skip it entirely and
> > select another extent to allocate from.
> >
> > Adding another two boolean conditionals to the already complex
> > extent selection for this specific case makes the code much harder
> > to follow and reason about. I'd much prefer that we just do
> > something like:
> >
> > 	if (busy && args->resv == XFS_AG_RESV_AGFL) {
> > 		/*
> > 		 * Extent might have just been freed in this
> > 		 * transaction so we can't use it. Move to the next
> > 		 * best extent candidate and try that instead.
> > 		 */
> > 		<increment/decrement and continue the search loop>
> > 	}
> >
> > IOWs, we should not be issuing a log force to flush busy extents if
> > we can't use the largest candidate free extent for the AGFL - we
> > should just keep searching until we find one we can use....
> 
> IIUC, the following patch implements the solution that has been suggested
> above,
> 
> diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> index aaa19101bb2a..25456dbff767 100644
> --- a/fs/xfs/libxfs/xfs_alloc.c
> +++ b/fs/xfs/libxfs/xfs_alloc.c
> @@ -1694,6 +1694,7 @@ xfs_alloc_ag_vextent_size(
>  	 * are no smaller extents available.
>  	 */
>  	if (!i) {
> +alloc_small_extent:
>  		error = xfs_alloc_ag_vextent_small(args, cnt_cur,
>  						   &fbno, &flen, &i);
>  		if (error)
> @@ -1707,6 +1708,8 @@ xfs_alloc_ag_vextent_size(
>  		busy = xfs_alloc_compute_aligned(args, fbno, flen, &rbno,
>  				&rlen, &busy_gen);
>  	} else {
> +		xfs_agblock_t	orig_fbno = NULLAGBLOCK;
> +		xfs_extlen_t	orig_flen;
>  		/*
>  		 * Search for a non-busy extent that is large enough.
>  		 */
> @@ -1719,6 +1722,11 @@ xfs_alloc_ag_vextent_size(
>  				goto error0;
>  			}
> 
> +			if (orig_fbno == NULLAGBLOCK) {
> +				orig_fbno = fbno;
> +				orig_flen = flen;
> +			}
> +
>  			busy = xfs_alloc_compute_aligned(args, fbno, flen,
>  					&rbno, &rlen, &busy_gen);
> 
> @@ -1734,6 +1742,14 @@ xfs_alloc_ag_vextent_size(
>  				 * Make it unbusy by forcing the log out and
>  				 * retrying.
>  				 */
> +				if (args->resv == XFS_AG_RESV_AGFL) {
> +					error = xfs_alloc_lookup_eq(cnt_cur,
> +							orig_fbno, orig_flen, &i);
> +					ASSERT(!error && i);
> +
> +					goto alloc_small_extent;
> +				}
> +
>  				xfs_btree_del_cursor(cnt_cur,
>  						     XFS_BTREE_NOERROR);
>  				trace_xfs_alloc_size_busy(args);
> @@ -1819,7 +1835,7 @@ xfs_alloc_ag_vextent_size(
>  	 */
>  	args->len = rlen;
>  	if (rlen < args->minlen) {
> -		if (busy) {
> +		if (busy && args->resv != XFS_AG_RESV_AGFL) {
>  			xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR);
>  			trace_xfs_alloc_size_busy(args);
>  			xfs_extent_busy_flush(args->mp, args->pag, busy_gen);
> 
> i.e.  when we end up at the right most edge of the cntbt during allocation of
> blocks for refilling AGFL, the above patch backtracks and continues search
> towards the left edge of the cntbt instead of flushing the CIL. If the
> leftmost edge is reached without finding any suitable free extent and the
> blocks are being allocated for AGFL, the function returns back to the caller
> instead of flushing the CIL and retrying once again.

At which point, we know that all the free extents in that AG are
either busy or we are truly out of space. Hence if this search
fails, it makes sense to call xfs_extent_busy_flush() to wait for
all the busy extents in the AG to complete their processing before
trying again.

> With the above patch, a workload which consists of,
> 1. Filling up 90% of the free space of the filesystem.
> 2. Punch alternate blocks of files.
> 
> .. would cause failure when inserting records into either cntbt/bnobt due to
> unavailability of AGFL blocks.
> 
> This happens because most of the free blocks resulting from punching out
> alternate blocks would be residing in the CIL's extent busy list. xfs/538
> creates 1G sized scratch filesystem and the "punch alternate blocks" workload
> creates a little more than 8000 entries in the CIL extent busy list.

Seems like you broke the existing handling of this situation by
preventing the AGFL filling code from flushing the busy extents when
all the AG can find is busy extents.

> So, may be there are no other alternatives other than to flush the CIL. To

Sure, I never suggested that we completely elide log forces. What I
said is that we -shouldn't immediately resort to a log force- because
the first maxlen extent match we come across is busy and can't
immediately be reused.

That is, the code still needs to call xfs_extent_busy_flush() and
try the allocation again, it just needs to do it when no candidate
extent can be found instead of after the first candidate is found to
be busy.

> that end, I have tried to slightly simplify the patch that I had originally
> sent (i.e. [PATCH 2/2] xfs: Prevent deadlock when allocating blocks for
> AGFL). The new patch removes one the boolean variables
> (i.e. alloc_small_extent) and also skips redundant searching of extent records
> when backtracking in preparation for searching smaller extents.

I still don't think this is right approach because it tries to
correct a bad decision (use a busy extent instead of trying the next
free extent) with another bad decision (log force might not unbusy
the extent we are trying to allocate). We should not do either of
these things in this situation, nor do we need to mark busy extents
as being in a transaction to avoid deadlocks.

That is, if all free extents are busy and there is nothing we can
allocate in the AG for the AGFL, then flush the busy extents and try
again while we hold the AGF locked. Because we hold the AGF locked,
nobody else can create new busy extents in the AG while we wait.
That means after a busy extent flush any remaining busy extents in
this AG are ones that we hold busy in this transaction and are the
ones we need to avoid allocating from in the first place.

IOWs, we don't need to mark busy extents as being in a transaction
at all - we know that this is the only way we can have a busy extent
in the AG after we flush busy extents while holding the AGF locked.
And that means if we still can't find a free extent after a busy
extent flush, then we're definitely at ENOSPC in that AG as there
are no free extents we can safely allocate from in the AG....

Cheers,

Dave.

Chandan Babu R May 3, 2021, 9:52 a.m. UTC | #4

On 01 May 2021 at 04:14, Dave Chinner wrote:
> On Fri, Apr 30, 2021 at 07:10:31PM +0530, Chandan Babu R wrote:
>> On 29 Apr 2021 at 06:42, Dave Chinner wrote:
>> > On Wed, Apr 28, 2021 at 12:21:52PM +0530, Chandan Babu R wrote:
>> >> Executing xfs/538 after disabling injection of bmap_alloc_minlen_extent error
>> >> can cause several tasks to trigger hung task timeout. Most of the tasks are
>> >> blocked on getting a lock on an AG's AGF buffer. However, The task which has
>> >> the lock on the AG's AGF buffer has the following call trace,
>> >>
>> [..]
>> > Hmmm. I don't doubt that this fixes the symptom you are seeing, but
>> > the way it is being fixed doesn't seem right to me at all.
>> >
>> > We're rtying to populate the AGFL here, and the fact is that a
>> > multi-block allocation is simply an optimisation to minimise the
>> > number of extents we need to allocate to fill the AGFL. The extent
>> > that gets allocated gets broken up into single blocks to be inserted
>> > into the AGFL, so we don't actually need a continuguous extent to be
>> > allocated here.
>> >
>> > Hence, if the extent we find is busy when allocating for the AGFL,
>> > we should just skip it and choose another extent. args->minlen is
>> > set to zero for the allocation, so we can actually return any extent
>> > that has a length <= args->maxlen. We know this is an AGFL
>> > allocation because args->resv == XFS_AG_RESV_AGFL, so if we find a
>> > busy extent that would require a log force to be able to use before
>> > we can place it in the AGFL, we should just skip it entirely and
>> > select another extent to allocate from.
>> >
>> > Adding another two boolean conditionals to the already complex
>> > extent selection for this specific case makes the code much harder
>> > to follow and reason about. I'd much prefer that we just do
>> > something like:
>> >
>> > 	if (busy && args->resv == XFS_AG_RESV_AGFL) {
>> > 		/*
>> > 		 * Extent might have just been freed in this
>> > 		 * transaction so we can't use it. Move to the next
>> > 		 * best extent candidate and try that instead.
>> > 		 */
>> > 		<increment/decrement and continue the search loop>
>> > 	}
>> >
>> > IOWs, we should not be issuing a log force to flush busy extents if
>> > we can't use the largest candidate free extent for the AGFL - we
>> > should just keep searching until we find one we can use....
>>
>> IIUC, the following patch implements the solution that has been suggested
>> above,
>>
>> diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
>> index aaa19101bb2a..25456dbff767 100644
>> --- a/fs/xfs/libxfs/xfs_alloc.c
>> +++ b/fs/xfs/libxfs/xfs_alloc.c
>> @@ -1694,6 +1694,7 @@ xfs_alloc_ag_vextent_size(
>>  	 * are no smaller extents available.
>>  	 */
>>  	if (!i) {
>> +alloc_small_extent:
>>  		error = xfs_alloc_ag_vextent_small(args, cnt_cur,
>>  						   &fbno, &flen, &i);
>>  		if (error)
>> @@ -1707,6 +1708,8 @@ xfs_alloc_ag_vextent_size(
>>  		busy = xfs_alloc_compute_aligned(args, fbno, flen, &rbno,
>>  				&rlen, &busy_gen);
>>  	} else {
>> +		xfs_agblock_t	orig_fbno = NULLAGBLOCK;
>> +		xfs_extlen_t	orig_flen;
>>  		/*
>>  		 * Search for a non-busy extent that is large enough.
>>  		 */
>> @@ -1719,6 +1722,11 @@ xfs_alloc_ag_vextent_size(
>>  				goto error0;
>>  			}
>>
>> +			if (orig_fbno == NULLAGBLOCK) {
>> +				orig_fbno = fbno;
>> +				orig_flen = flen;
>> +			}
>> +
>>  			busy = xfs_alloc_compute_aligned(args, fbno, flen,
>>  					&rbno, &rlen, &busy_gen);
>>
>> @@ -1734,6 +1742,14 @@ xfs_alloc_ag_vextent_size(
>>  				 * Make it unbusy by forcing the log out and
>>  				 * retrying.
>>  				 */
>> +				if (args->resv == XFS_AG_RESV_AGFL) {
>> +					error = xfs_alloc_lookup_eq(cnt_cur,
>> +							orig_fbno, orig_flen, &i);
>> +					ASSERT(!error && i);
>> +
>> +					goto alloc_small_extent;
>> +				}
>> +
>>  				xfs_btree_del_cursor(cnt_cur,
>>  						     XFS_BTREE_NOERROR);
>>  				trace_xfs_alloc_size_busy(args);
>> @@ -1819,7 +1835,7 @@ xfs_alloc_ag_vextent_size(
>>  	 */
>>  	args->len = rlen;
>>  	if (rlen < args->minlen) {
>> -		if (busy) {
>> +		if (busy && args->resv != XFS_AG_RESV_AGFL) {
>>  			xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR);
>>  			trace_xfs_alloc_size_busy(args);
>>  			xfs_extent_busy_flush(args->mp, args->pag, busy_gen);
>>
>> i.e.  when we end up at the right most edge of the cntbt during allocation of
>> blocks for refilling AGFL, the above patch backtracks and continues search
>> towards the left edge of the cntbt instead of flushing the CIL. If the
>> leftmost edge is reached without finding any suitable free extent and the
>> blocks are being allocated for AGFL, the function returns back to the caller
>> instead of flushing the CIL and retrying once again.
>
> At which point, we know that all the free extents in that AG are
> either busy or we are truly out of space. Hence if this search
> fails, it makes sense to call xfs_extent_busy_flush() to wait for
> all the busy extents in the AG to complete their processing before
> trying again.
>
>> With the above patch, a workload which consists of,
>> 1. Filling up 90% of the free space of the filesystem.
>> 2. Punch alternate blocks of files.
>>
>> .. would cause failure when inserting records into either cntbt/bnobt due to
>> unavailability of AGFL blocks.
>>
>> This happens because most of the free blocks resulting from punching out
>> alternate blocks would be residing in the CIL's extent busy list. xfs/538
>> creates 1G sized scratch filesystem and the "punch alternate blocks" workload
>> creates a little more than 8000 entries in the CIL extent busy list.
>
> Seems like you broke the existing handling of this situation by
> preventing the AGFL filling code from flushing the busy extents when
> all the AG can find is busy extents.
>
>> So, may be there are no other alternatives other than to flush the CIL. To
>
> Sure, I never suggested that we completely elide log forces. What I
> said is that we -shouldn't immediately resort to a log force- because
> the first maxlen extent match we come across is busy and can't
> immediately be reused.
>
> That is, the code still needs to call xfs_extent_busy_flush() and
> try the allocation again, it just needs to do it when no candidate
> extent can be found instead of after the first candidate is found to
> be busy.

You are right. When allocating blocks to replenish the AGFL, if searching for
free extents whose length is >= xfs_alloc_args->maxlen yields only busy
extents, we should backtrack and search for extents of smaller length since
AGFL does not need individual blocks to be contiguous. However, If the search
among the smaller length extents again yields only busy extents, we should
invoke xfs_extent_busy_flush() to mark the corresponding extents as
unbusy and restarting the search. However, AFAICT there is one nit ...

>
>> that end, I have tried to slightly simplify the patch that I had originally
>> sent (i.e. [PATCH 2/2] xfs: Prevent deadlock when allocating blocks for
>> AGFL). The new patch removes one the boolean variables
>> (i.e. alloc_small_extent) and also skips redundant searching of extent records
>> when backtracking in preparation for searching smaller extents.
>
> I still don't think this is right approach because it tries to
> correct a bad decision (use a busy extent instead of trying the next
> free extent) with another bad decision (log force might not unbusy
> the extent we are trying to allocate). We should not do either of
> these things in this situation, nor do we need to mark busy extents
> as being in a transaction to avoid deadlocks.
>
> That is, if all free extents are busy and there is nothing we can
> allocate in the AG for the AGFL, then flush the busy extents and try
> again while we hold the AGF locked. Because we hold the AGF locked,
> nobody else can create new busy extents in the AG while we wait.
> That means after a busy extent flush any remaining busy extents in
> this AG are ones that we hold busy in this transaction and are the
> ones we need to avoid allocating from in the first place.
>
> IOWs, we don't need to mark busy extents as being in a transaction
> at all - we know that this is the only way we can have a busy extent
> in the AG after we flush busy extents while holding the AGF locked.
> And that means if we still can't find a free extent after a busy
> extent flush, then we're definitely at ENOSPC in that AG as there
> are no free extents we can safely allocate from in the AG....

... Assume that there is one free busy extent in an AG and that it is 1 block
in length. Also assume that the free extent is busy in the current
transaction.

An extent free operation corresponding to the 2nd xfs_extent_free_item will
invoke xfs_alloc_ag_extent_size() with the following results,

1. Search starts at the only free extent and proceeds towards the left most
   edge of the cntbt.

2. Since there is only one free extent and that it is also busy, we now invoke
   xfs_extent_busy_flush().

3. xfs_extent_busy_flush() flushes the CIL and waits for the "busy generation"
   number to change. This event will never occur since the only free extent is
   busy in the current transaction. Hence the task will now wait indefinitely.

PS: I am not 100% sure if the above mentioned scenario (i.e. having only one
extent free in an AG and also it being marked as busy) is actually
possible. After going through the corresponding source code, I could not find
any evidence to the contrary.

--
chandan

Dave Chinner May 4, 2021, 12:03 a.m. UTC | #5

On Mon, May 03, 2021 at 03:22:10PM +0530, Chandan Babu R wrote:
> On 01 May 2021 at 04:14, Dave Chinner wrote:
> >> that end, I have tried to slightly simplify the patch that I had originally
> >> sent (i.e. [PATCH 2/2] xfs: Prevent deadlock when allocating blocks for
> >> AGFL). The new patch removes one the boolean variables
> >> (i.e. alloc_small_extent) and also skips redundant searching of extent records
> >> when backtracking in preparation for searching smaller extents.
> >
> > I still don't think this is right approach because it tries to
> > correct a bad decision (use a busy extent instead of trying the next
> > free extent) with another bad decision (log force might not unbusy
> > the extent we are trying to allocate). We should not do either of
> > these things in this situation, nor do we need to mark busy extents
> > as being in a transaction to avoid deadlocks.
> >
> > That is, if all free extents are busy and there is nothing we can
> > allocate in the AG for the AGFL, then flush the busy extents and try
> > again while we hold the AGF locked. Because we hold the AGF locked,
> > nobody else can create new busy extents in the AG while we wait.
> > That means after a busy extent flush any remaining busy extents in
> > this AG are ones that we hold busy in this transaction and are the
> > ones we need to avoid allocating from in the first place.
> >
> > IOWs, we don't need to mark busy extents as being in a transaction
> > at all - we know that this is the only way we can have a busy extent
> > in the AG after we flush busy extents while holding the AGF locked.
> > And that means if we still can't find a free extent after a busy
> > extent flush, then we're definitely at ENOSPC in that AG as there
> > are no free extents we can safely allocate from in the AG....
> 
> ... Assume that there is one free busy extent in an AG and that it is 1 block
> in length. Also assume that the free extent is busy in the current
> transaction.

ISTR that this won't happen during extent allocation because the
transaction reservation and the AG selection code is supposed to
ensure there are sufficient free blocks both globally and in the AG
for the entire operation, not just one part of it.

Also, the extent freeing path is this:

...
  __xfs_free_extent()
    xfs_free_extent_fix_freelist()
      xfs_alloc_fix_freelist(XFS_ALLOC_FLAG_FREEING)

And that XFS_ALLOC_FLAG_FREEING is special - it means that we:

a) always say there is space available in the AG for the freeing
operation to take place, and
b) only perform best effort allocation to fill up the free list.

Case b) triggers this code:

                /*
                 * Stop if we run out.  Won't happen if callers are obeying
                 * the restrictions correctly.  Can happen for free calls
                 * on a completely full ag.
                 */
                if (targs.agbno == NULLAGBLOCK) {
                        if (flags & XFS_ALLOC_FLAG_FREEING)
                                break;
                        goto out_agflbp_relse;
                }

That is, if we fail to fix up the free list, we still go ahead with
the operation because freeing extents when we are at ENOSPC means
that, by definition, we don't need to allocate blocks to track the
new free space because the new free space records will fit inside
the root btree blocks that are already allocated.

Hence when doing allocation for the free list, we need to fail the
allocation rather than block on the only remaining free extent in
the AG. If we are freeing extents, the AGFL not being full is not an
issue at all. And if we are allocating extents, the transaction
reservations should have ensured that the AG had sufficient space in
it to complete the entire operation without deadlocking waiting for
space.

Either way, I don't see a problem with making sure the AGFL
allocations just skip busy extents and fail if the only free extents
are ones this transaction has freed itself.

Cheers,

Dave.

Chandan Babu R May 5, 2021, 12:42 p.m. UTC | #6

On 04 May 2021 at 05:33, Dave Chinner wrote:
> On Mon, May 03, 2021 at 03:22:10PM +0530, Chandan Babu R wrote:
>> On 01 May 2021 at 04:14, Dave Chinner wrote:
>> >> that end, I have tried to slightly simplify the patch that I had originally
>> >> sent (i.e. [PATCH 2/2] xfs: Prevent deadlock when allocating blocks for
>> >> AGFL). The new patch removes one the boolean variables
>> >> (i.e. alloc_small_extent) and also skips redundant searching of extent records
>> >> when backtracking in preparation for searching smaller extents.
>> >
>> > I still don't think this is right approach because it tries to
>> > correct a bad decision (use a busy extent instead of trying the next
>> > free extent) with another bad decision (log force might not unbusy
>> > the extent we are trying to allocate). We should not do either of
>> > these things in this situation, nor do we need to mark busy extents
>> > as being in a transaction to avoid deadlocks.
>> >
>> > That is, if all free extents are busy and there is nothing we can
>> > allocate in the AG for the AGFL, then flush the busy extents and try
>> > again while we hold the AGF locked. Because we hold the AGF locked,
>> > nobody else can create new busy extents in the AG while we wait.
>> > That means after a busy extent flush any remaining busy extents in
>> > this AG are ones that we hold busy in this transaction and are the
>> > ones we need to avoid allocating from in the first place.
>> >
>> > IOWs, we don't need to mark busy extents as being in a transaction
>> > at all - we know that this is the only way we can have a busy extent
>> > in the AG after we flush busy extents while holding the AGF locked.
>> > And that means if we still can't find a free extent after a busy
>> > extent flush, then we're definitely at ENOSPC in that AG as there
>> > are no free extents we can safely allocate from in the AG....
>>
>> ... Assume that there is one free busy extent in an AG and that it is 1 block
>> in length. Also assume that the free extent is busy in the current
>> transaction.
>
> ISTR that this won't happen during extent allocation because the
> transaction reservation and the AG selection code is supposed to
> ensure there are sufficient free blocks both globally and in the AG
> for the entire operation, not just one part of it.
>
> Also, the extent freeing path is this:
>
> ...
>   __xfs_free_extent()
>     xfs_free_extent_fix_freelist()
>       xfs_alloc_fix_freelist(XFS_ALLOC_FLAG_FREEING)
>
> And that XFS_ALLOC_FLAG_FREEING is special - it means that we:
>
> a) always say there is space available in the AG for the freeing
> operation to take place, and
> b) only perform best effort allocation to fill up the free list.
>
> Case b) triggers this code:
>
>                 /*
>                  * Stop if we run out.  Won't happen if callers are obeying
>                  * the restrictions correctly.  Can happen for free calls
>                  * on a completely full ag.
>                  */
>                 if (targs.agbno == NULLAGBLOCK) {
>                         if (flags & XFS_ALLOC_FLAG_FREEING)
>                                 break;
>                         goto out_agflbp_relse;
>                 }
>
>
> That is, if we fail to fix up the free list, we still go ahead with
> the operation because freeing extents when we are at ENOSPC means
> that, by definition, we don't need to allocate blocks to track the
> new free space because the new free space records will fit inside
> the root btree blocks that are already allocated.
>
> Hence when doing allocation for the free list, we need to fail the
> allocation rather than block on the only remaining free extent in
> the AG. If we are freeing extents, the AGFL not being full is not an
> issue at all. And if we are allocating extents, the transaction
> reservations should have ensured that the AG had sufficient space in
> it to complete the entire operation without deadlocking waiting for
> space.
>
> Either way, I don't see a problem with making sure the AGFL
> allocations just skip busy extents and fail if the only free extents
> are ones this transaction has freed itself.
>

Hmm. In the scenario where *all* free extents in the AG were originally freed
by the current transaction (and hence busy in the transaction), we would need
to be able to recognize this situation and skip invoking
xfs_extent_busy_flush() altogether. Otherwise, xfs_extent_busy_flush() invokes
xfs_log_force() and keeps waiting for busy generation number to change.

Hence, IMHO we would need an extent busy flag (e.g. XFS_EXTENT_BUSY_IN_TRANS)
to correctly determine if all the busy extents are indeed busy in the current
transaction.

--
chandan

Dave Chinner May 6, 2021, 3:27 a.m. UTC | #7

On Wed, May 05, 2021 at 06:12:41PM +0530, Chandan Babu R wrote:
> > Hence when doing allocation for the free list, we need to fail the
> > allocation rather than block on the only remaining free extent in
> > the AG. If we are freeing extents, the AGFL not being full is not an
> > issue at all. And if we are allocating extents, the transaction
> > reservations should have ensured that the AG had sufficient space in
> > it to complete the entire operation without deadlocking waiting for
> > space.
> >
> > Either way, I don't see a problem with making sure the AGFL
> > allocations just skip busy extents and fail if the only free extents
> > are ones this transaction has freed itself.
> >
> 
> Hmm. In the scenario where *all* free extents in the AG were originally freed
> by the current transaction (and hence busy in the transaction),

How does that happen? 

> we would need
> to be able to recognize this situation and skip invoking
> xfs_extent_busy_flush() altogether.

If we are freeing extents (i.e XFS_ALLOC_FLAG_FREEING is set) and
we are doing allocation for AGFL and we only found busy extents,
then it's OK to fail the allocation.

We have options here - once we get to the end of the btree and
haven't found a candidate that isn't busy, we could fail
immediately. Or maybe we try an optimisitic flush which forces the
log and waits for as short while (instead of forever) for the
generation to change and then fail if we get a timeout response. Or
maybe there's a more elegant way of doing this that hasn't yet
rattled out of my poor, overloaded brain right now.

Just because we currently do a blocking flush doesn't mean we always
must do a blocking flush....

Cheers,

Dave.

Chandan Babu R May 11, 2021, 11:49 a.m. UTC | #8

On 06 May 2021 at 08:57, Dave Chinner wrote:
> On Wed, May 05, 2021 at 06:12:41PM +0530, Chandan Babu R wrote:
>> > Hence when doing allocation for the free list, we need to fail the
>> > allocation rather than block on the only remaining free extent in
>> > the AG. If we are freeing extents, the AGFL not being full is not an
>> > issue at all. And if we are allocating extents, the transaction
>> > reservations should have ensured that the AG had sufficient space in
>> > it to complete the entire operation without deadlocking waiting for
>> > space.
>> >
>> > Either way, I don't see a problem with making sure the AGFL
>> > allocations just skip busy extents and fail if the only free extents
>> > are ones this transaction has freed itself.
>> >
>>
>> Hmm. In the scenario where *all* free extents in the AG were originally freed
>> by the current transaction (and hence busy in the transaction),
>
> How does that happen?

I tried in vain to arrive at the above mentioned scenario by consuming away as
many blocks as possible from the filesystem. At best, I could arrive at an AG
with just one free extent record in the cntbt (NOTE: I had to disable global
reservation by invoking "xfs_io -x -c 'resblks 0' $mntpnt"):

recs[1] = [startblock,blockcount]
1:[32767,1]

For each AG available in an FS instance, we take away 8
(i.e. XFS_ALLOC_AGFL_RESERVE + 4) blocks from the global free data blocks
counter. This reservation is applied to the FS as a whole rather than each AG
individually. Hence we could get to a scenario where an AG could have less
than 8 free blocks. I could not find any other restriction in the code that
explicitly prevents an AG from having zero free extents.

However, I could not create such an AG because any fs operation that needs
extent allocation to be done would try to reserve more than 1 extent causing
the above cited AG to not be chosen.

>
>> we would need
>> to be able to recognize this situation and skip invoking
>> xfs_extent_busy_flush() altogether.
>
> If we are freeing extents (i.e XFS_ALLOC_FLAG_FREEING is set) and
> we are doing allocation for AGFL and we only found busy extents,
> then it's OK to fail the allocation.

When freeing an extent, the following patch skips allocation of blocks to AGFL
if all the free extents found are busy,

diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index aaa19101bb2a..5310e311d5c6 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -1694,6 +1694,7 @@ xfs_alloc_ag_vextent_size(
 	 * are no smaller extents available.
 	 */
 	if (!i) {
+alloc_small_extent:
 		error = xfs_alloc_ag_vextent_small(args, cnt_cur,
 						   &fbno, &flen, &i);
 		if (error)
@@ -1710,6 +1711,9 @@ xfs_alloc_ag_vextent_size(
 		/*
 		 * Search for a non-busy extent that is large enough.
 		 */
+		xfs_agblock_t	orig_fbno = NULLAGBLOCK;
+		xfs_extlen_t	orig_flen;
+
 		for (;;) {
 			error = xfs_alloc_get_rec(cnt_cur, &fbno, &flen, &i);
 			if (error)
@@ -1719,6 +1723,11 @@ xfs_alloc_ag_vextent_size(
 				goto error0;
 			}

+			if (orig_fbno == NULLAGBLOCK) {
+				orig_fbno = fbno;
+				orig_flen = flen;
+			}
+
 			busy = xfs_alloc_compute_aligned(args, fbno, flen,
 					&rbno, &rlen, &busy_gen);

@@ -1729,6 +1738,13 @@ xfs_alloc_ag_vextent_size(
 			if (error)
 				goto error0;
 			if (i == 0) {
+				if (args->freeing_extent) {
+					error = xfs_alloc_lookup_eq(cnt_cur,
+							orig_fbno, orig_flen, &i);
+					ASSERT(!error && i);
+					goto alloc_small_extent;
+				}
+
 				/*
 				 * Our only valid extents must have been busy.
 				 * Make it unbusy by forcing the log out and
@@ -1819,7 +1835,7 @@ xfs_alloc_ag_vextent_size(
 	 */
 	args->len = rlen;
 	if (rlen < args->minlen) {
-		if (busy) {
+		if (busy && !args->freeing_extent) {
 			xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR);
 			trace_xfs_alloc_size_busy(args);
 			xfs_extent_busy_flush(args->mp, args->pag, busy_gen);
@@ -2641,6 +2657,7 @@ xfs_alloc_fix_freelist(
 	targs.alignment = targs.minlen = targs.prod = 1;
 	targs.type = XFS_ALLOCTYPE_THIS_AG;
 	targs.pag = pag;
+	targs.freeing_extent = flags & XFS_ALLOC_FLAG_FREEING;
 	error = xfs_alloc_read_agfl(mp, tp, targs.agno, &agflbp);
 	if (error)
 		goto out_agbp_relse;
diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h
index a4427c5775c2..1e0fc28ef87a 100644
--- a/fs/xfs/libxfs/xfs_alloc.h
+++ b/fs/xfs/libxfs/xfs_alloc.h
@@ -78,6 +78,7 @@ typedef struct xfs_alloc_arg {
 #ifdef DEBUG
 	bool		alloc_minlen_only; /* allocate exact minlen extent */
 #endif
+	bool		freeing_extent;
 } xfs_alloc_arg_t;

 /*

With the above patch, xfs/538 cause the following call trace to be printed,

   XFS (vdc2): Internal error i != 1 at line 3426 of file fs/xfs/libxfs/xfs_btree.c.  Caller xfs_btree_insert+0x15c/0x1f0
   CPU: 2 PID: 1284 Comm: punch-alternati Not tainted 5.12.0-rc8-next-20210419-chandan #19
   Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
   Call Trace:
    dump_stack+0x64/0x7c
    xfs_corruption_error+0x85/0x90
    ? xfs_btree_insert+0x15c/0x1f0
    xfs_btree_insert+0x18d/0x1f0
    ? xfs_btree_insert+0x15c/0x1f0
    ? xfs_allocbt_init_common+0x30/0xf0
    xfs_free_ag_extent+0x463/0x9d0
    __xfs_free_extent+0xe5/0x200
    xfs_trans_free_extent+0x3e/0x100
    xfs_extent_free_finish_item+0x24/0x40
    xfs_defer_finish_noroll+0x1f7/0x5c0
    __xfs_trans_commit+0x12f/0x300
    xfs_free_file_space+0x1af/0x2c0
    xfs_file_fallocate+0x1ca/0x430
    ? __cond_resched+0x16/0x40
    ? inode_security+0x22/0x60
    ? selinux_file_permission+0xe2/0x120
    vfs_fallocate+0x146/0x2e0
    __x64_sys_fallocate+0x3e/0x70
    do_syscall_64+0x40/0x80
    entry_SYSCALL_64_after_hwframe+0x44/0xae

The above call trace occurs during execution of the step #2 listed below,
1. Filling up 90% of the free space of the filesystem.
2. Punch alternate blocks of files.

Just before the failure, the filesystem had ~9000 busy extents. So I think we
have to flush busy extents even when refilling AGFL for the purpose of freeing
an extent.

>
> We have options here - once we get to the end of the btree and
> haven't found a candidate that isn't busy, we could fail
> immediately. Or maybe we try an optimisitic flush which forces the
> log and waits for as short while (instead of forever) for the
> generation to change and then fail if we get a timeout response. Or
> maybe there's a more elegant way of doing this that hasn't yet
> rattled out of my poor, overloaded brain right now.
>
> Just because we currently do a blocking flush doesn't mean we always
> must do a blocking flush....

I will try to work out a solution.

--
chandan

Chandan Babu R June 17, 2021, 4:48 a.m. UTC | #9

On 11 May 2021 at 17:19, Chandan Babu R wrote:
> On 06 May 2021 at 08:57, Dave Chinner wrote:
>> On Wed, May 05, 2021 at 06:12:41PM +0530, Chandan Babu R wrote:
>>> > Hence when doing allocation for the free list, we need to fail the
>>> > allocation rather than block on the only remaining free extent in
>>> > the AG. If we are freeing extents, the AGFL not being full is not an
>>> > issue at all. And if we are allocating extents, the transaction
>>> > reservations should have ensured that the AG had sufficient space in
>>> > it to complete the entire operation without deadlocking waiting for
>>> > space.
>>> >
>>> > Either way, I don't see a problem with making sure the AGFL
>>> > allocations just skip busy extents and fail if the only free extents
>>> > are ones this transaction has freed itself.
>>> >
>>>
>>> Hmm. In the scenario where *all* free extents in the AG were originally freed
>>> by the current transaction (and hence busy in the transaction),
>>
>> How does that happen?
>
> I tried in vain to arrive at the above mentioned scenario by consuming away as
> many blocks as possible from the filesystem. At best, I could arrive at an AG
> with just one free extent record in the cntbt (NOTE: I had to disable global
> reservation by invoking "xfs_io -x -c 'resblks 0' $mntpnt"):
>
> recs[1] = [startblock,blockcount]
> 1:[32767,1]
>
> For each AG available in an FS instance, we take away 8
> (i.e. XFS_ALLOC_AGFL_RESERVE + 4) blocks from the global free data blocks
> counter. This reservation is applied to the FS as a whole rather than each AG
> individually. Hence we could get to a scenario where an AG could have less
> than 8 free blocks. I could not find any other restriction in the code that
> explicitly prevents an AG from having zero free extents.
>
> However, I could not create such an AG because any fs operation that needs
> extent allocation to be done would try to reserve more than 1 extent causing
> the above cited AG to not be chosen.
>
>>
>>> we would need
>>> to be able to recognize this situation and skip invoking
>>> xfs_extent_busy_flush() altogether.
>>
>> If we are freeing extents (i.e XFS_ALLOC_FLAG_FREEING is set) and
>> we are doing allocation for AGFL and we only found busy extents,
>> then it's OK to fail the allocation.
>
> When freeing an extent, the following patch skips allocation of blocks to AGFL
> if all the free extents found are busy,
>
> diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> index aaa19101bb2a..5310e311d5c6 100644
> --- a/fs/xfs/libxfs/xfs_alloc.c
> +++ b/fs/xfs/libxfs/xfs_alloc.c
> @@ -1694,6 +1694,7 @@ xfs_alloc_ag_vextent_size(
>  	 * are no smaller extents available.
>  	 */
>  	if (!i) {
> +alloc_small_extent:
>  		error = xfs_alloc_ag_vextent_small(args, cnt_cur,
>  						   &fbno, &flen, &i);
>  		if (error)
> @@ -1710,6 +1711,9 @@ xfs_alloc_ag_vextent_size(
>  		/*
>  		 * Search for a non-busy extent that is large enough.
>  		 */
> +		xfs_agblock_t	orig_fbno = NULLAGBLOCK;
> +		xfs_extlen_t	orig_flen;
> +
>  		for (;;) {
>  			error = xfs_alloc_get_rec(cnt_cur, &fbno, &flen, &i);
>  			if (error)
> @@ -1719,6 +1723,11 @@ xfs_alloc_ag_vextent_size(
>  				goto error0;
>  			}
>
> +			if (orig_fbno == NULLAGBLOCK) {
> +				orig_fbno = fbno;
> +				orig_flen = flen;
> +			}
> +
>  			busy = xfs_alloc_compute_aligned(args, fbno, flen,
>  					&rbno, &rlen, &busy_gen);
>
> @@ -1729,6 +1738,13 @@ xfs_alloc_ag_vextent_size(
>  			if (error)
>  				goto error0;
>  			if (i == 0) {
> +				if (args->freeing_extent) {
> +					error = xfs_alloc_lookup_eq(cnt_cur,
> +							orig_fbno, orig_flen, &i);
> +					ASSERT(!error && i);
> +					goto alloc_small_extent;
> +				}
> +
>  				/*
>  				 * Our only valid extents must have been busy.
>  				 * Make it unbusy by forcing the log out and
> @@ -1819,7 +1835,7 @@ xfs_alloc_ag_vextent_size(
>  	 */
>  	args->len = rlen;
>  	if (rlen < args->minlen) {
> -		if (busy) {
> +		if (busy && !args->freeing_extent) {
>  			xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR);
>  			trace_xfs_alloc_size_busy(args);
>  			xfs_extent_busy_flush(args->mp, args->pag, busy_gen);
> @@ -2641,6 +2657,7 @@ xfs_alloc_fix_freelist(
>  	targs.alignment = targs.minlen = targs.prod = 1;
>  	targs.type = XFS_ALLOCTYPE_THIS_AG;
>  	targs.pag = pag;
> +	targs.freeing_extent = flags & XFS_ALLOC_FLAG_FREEING;
>  	error = xfs_alloc_read_agfl(mp, tp, targs.agno, &agflbp);
>  	if (error)
>  		goto out_agbp_relse;
> diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h
> index a4427c5775c2..1e0fc28ef87a 100644
> --- a/fs/xfs/libxfs/xfs_alloc.h
> +++ b/fs/xfs/libxfs/xfs_alloc.h
> @@ -78,6 +78,7 @@ typedef struct xfs_alloc_arg {
>  #ifdef DEBUG
>  	bool		alloc_minlen_only; /* allocate exact minlen extent */
>  #endif
> +	bool		freeing_extent;
>  } xfs_alloc_arg_t;
>
>  /*
>
> With the above patch, xfs/538 cause the following call trace to be printed,
>
>    XFS (vdc2): Internal error i != 1 at line 3426 of file fs/xfs/libxfs/xfs_btree.c.  Caller xfs_btree_insert+0x15c/0x1f0
>    CPU: 2 PID: 1284 Comm: punch-alternati Not tainted 5.12.0-rc8-next-20210419-chandan #19
>    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
>    Call Trace:
>     dump_stack+0x64/0x7c
>     xfs_corruption_error+0x85/0x90
>     ? xfs_btree_insert+0x15c/0x1f0
>     xfs_btree_insert+0x18d/0x1f0
>     ? xfs_btree_insert+0x15c/0x1f0
>     ? xfs_allocbt_init_common+0x30/0xf0
>     xfs_free_ag_extent+0x463/0x9d0
>     __xfs_free_extent+0xe5/0x200
>     xfs_trans_free_extent+0x3e/0x100
>     xfs_extent_free_finish_item+0x24/0x40
>     xfs_defer_finish_noroll+0x1f7/0x5c0
>     __xfs_trans_commit+0x12f/0x300
>     xfs_free_file_space+0x1af/0x2c0
>     xfs_file_fallocate+0x1ca/0x430
>     ? __cond_resched+0x16/0x40
>     ? inode_security+0x22/0x60
>     ? selinux_file_permission+0xe2/0x120
>     vfs_fallocate+0x146/0x2e0
>     __x64_sys_fallocate+0x3e/0x70
>     do_syscall_64+0x40/0x80
>     entry_SYSCALL_64_after_hwframe+0x44/0xae
>
> The above call trace occurs during execution of the step #2 listed below,
> 1. Filling up 90% of the free space of the filesystem.
> 2. Punch alternate blocks of files.
>
> Just before the failure, the filesystem had ~9000 busy extents. So I think we
> have to flush busy extents even when refilling AGFL for the purpose of freeing
> an extent.
>
>>
>> We have options here - once we get to the end of the btree and
>> haven't found a candidate that isn't busy, we could fail
>> immediately. Or maybe we try an optimisitic flush which forces the
>> log and waits for as short while (instead of forever) for the
>> generation to change and then fail if we get a timeout response. Or
>> maybe there's a more elegant way of doing this that hasn't yet
>> rattled out of my poor, overloaded brain right now.
>>
>> Just because we currently do a blocking flush doesn't mean we always
>> must do a blocking flush....
>
> I will try to work out a solution.

I believe the following should be taken into consideration to design an
"optimistic flush delay" based solution,
1. Time consumed to perform a discard operation on a filesystem's block.
2. The size of extents that are being discarded.
3. Number of discard operation requests contained in a bio.

AFAICT, The combinations resulting from the above make it impossible to
calculate a time delay during which sufficient number of busy extents are
guaranteed to have been freed so as to fill up the AGFL to the required
levels. In other words, sufficent number of busy extents may not have been
discarded even after the optimistic delay interval elapses.

The other solution that I had thought about was to introduce a new flag for
the second argument of xfs_log_force(). The new flag will cause
xlog_state_do_iclog_callbacks() to wait on completion of all of the CIL ctxs
associated with the iclog that xfs_log_force() would be waiting on. Hence, a
call to xfs_log_force(mp, NEW_SYNC_FLAG) will return only after all the busy
extents associated with the iclog are discarded.

However, this method is also flawed as described below.

----------------------------------------------------------
 Task A                        Task B
----------------------------------------------------------
 Submit a filled up iclog
 for write operation
 (Assume that the iclog
 has non-zero number of CIL
 ctxs associated with it).
 On completion of iclog write
 operation, discard requests
 for busy extents are issued.

 Write log records (including
 commit record) into another
 iclog.

                               A task which is trying
                               to fill AGFL will now
                               invoke xfs_log_force()
                               with the new sync
                               flag.
                               Submit the 2nd iclog which
                               was partially filled by
                               Task A.
                               If there are no
                               discard requests
                               associated this iclog,
                               xfs_log_force() will
                               return. As the discard
                               requests associated with
                               the first iclog are yet
                               to be completed,
                               we end up incorrectly
                               concluding that
                               all busy extents
                               have been processed.
----------------------------------------------------------

The inconsistency indicated above could also occur when discard requests
issued against second iclog get processed before discard requests associated
with the first iclog.

XFS_EXTENT_BUSY_IN_TRANS flag based solution is the only method that I can
think of that can solve this problem correctly. However I do agree with your
earlier observation that we should not flush busy extents unless we have
checked for presence of free extents in the btree records present on the left
side of the btree cursor.

--
chandan

[2/2] xfs: Prevent deadlock when allocating blocks for AGFL

Commit Message

Comments

Patch