[3/3] xfs: prevent mount and log shutdown race

Message ID	20241112221920.1105007-4-david@fromorbit.com (mailing list archive)
State	Accepted, archived
Headers	show Received: from mail-pf1-f177.google.com (mail-pf1-f177.google.com [209.85.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9809121745E for <linux-xfs@vger.kernel.org>; Tue, 12 Nov 2024 22:19:26 +0000 (UTC) From: Dave Chinner <david@fromorbit.com> To: linux-xfs@vger.kernel.org Cc: cem@kernel.org Subject: [PATCH 3/3] xfs: prevent mount and log shutdown race Date: Wed, 13 Nov 2024 09:05:16 +1100 Message-ID: <20241112221920.1105007-4-david@fromorbit.com> In-Reply-To: <20241112221920.1105007-1-david@fromorbit.com> References: <20241112221920.1105007-1-david@fromorbit.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	xfs: miscellaneous bug fixes \| expand [0/3] xfs: miscellaneous bug fixes [1/3] xfs: fix sparse inode limits on runt AG [2/3] xfs: delalloc and quota softlimit timers are incoherent [3/3] xfs: prevent mount and log shutdown race

Message ID

20241112221920.1105007-4-david@fromorbit.com (mailing list archive)

State

Accepted, archived

Headers

From: Dave Chinner <david@fromorbit.com>
To: linux-xfs@vger.kernel.org
Cc: cem@kernel.org
Subject: [PATCH 3/3] xfs: prevent mount and log shutdown race
Date: Wed, 13 Nov 2024 09:05:16 +1100
Message-ID: <20241112221920.1105007-4-david@fromorbit.com>
In-Reply-To: <20241112221920.1105007-1-david@fromorbit.com>
References: <20241112221920.1105007-1-david@fromorbit.com>
Precedence: bulk
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit

Series

xfs: miscellaneous bug fixes | expand

Commit Message

Dave Chinner Nov. 12, 2024, 10:05 p.m. UTC

From: Dave Chinner <dchinner@redhat.com>

I recently had an fstests hang where there were two internal tasks
stuck like so:

[ 6559.010870] task:kworker/24:45   state:D stack:12152 pid:631308 tgid:631308 ppid:2      flags:0x00004000
[ 6559.016984] Workqueue: xfs-buf/dm-2 xfs_buf_ioend_work
[ 6559.020349] Call Trace:
[ 6559.022002]  <TASK>
[ 6559.023426]  __schedule+0x650/0xb10
[ 6559.025734]  schedule+0x6d/0xf0
[ 6559.027835]  schedule_timeout+0x31/0x180
[ 6559.030582]  wait_for_common+0x10c/0x1e0
[ 6559.033495]  wait_for_completion+0x1d/0x30
[ 6559.036463]  __flush_workqueue+0xeb/0x490
[ 6559.039479]  ? mempool_alloc_slab+0x15/0x20
[ 6559.042537]  xlog_cil_force_seq+0xa1/0x2f0
[ 6559.045498]  ? bio_alloc_bioset+0x1d8/0x510
[ 6559.048578]  ? submit_bio_noacct+0x2f2/0x380
[ 6559.051665]  ? xlog_force_shutdown+0x3b/0x170
[ 6559.054819]  xfs_log_force+0x77/0x230
[ 6559.057455]  xlog_force_shutdown+0x3b/0x170
[ 6559.060507]  xfs_do_force_shutdown+0xd4/0x200
[ 6559.063798]  ? xfs_buf_rele+0x1bd/0x580
[ 6559.066541]  xfs_buf_ioend_handle_error+0x163/0x2e0
[ 6559.070099]  xfs_buf_ioend+0x61/0x200
[ 6559.072728]  xfs_buf_ioend_work+0x15/0x20
[ 6559.075706]  process_scheduled_works+0x1d4/0x400
[ 6559.078814]  worker_thread+0x234/0x2e0
[ 6559.081300]  kthread+0x147/0x170
[ 6559.083462]  ? __pfx_worker_thread+0x10/0x10
[ 6559.086295]  ? __pfx_kthread+0x10/0x10
[ 6559.088771]  ret_from_fork+0x3e/0x50
[ 6559.091153]  ? __pfx_kthread+0x10/0x10
[ 6559.093624]  ret_from_fork_asm+0x1a/0x30
[ 6559.096227]  </TASK>

[ 6559.109304] Workqueue: xfs-cil/dm-2 xlog_cil_push_work
[ 6559.112673] Call Trace:
[ 6559.114333]  <TASK>
[ 6559.115760]  __schedule+0x650/0xb10
[ 6559.118084]  schedule+0x6d/0xf0
[ 6559.120175]  schedule_timeout+0x31/0x180
[ 6559.122776]  ? call_rcu+0xee/0x2f0
[ 6559.125034]  __down_common+0xbe/0x1f0
[ 6559.127470]  __down+0x1d/0x30
[ 6559.129458]  down+0x48/0x50
[ 6559.131343]  ? xfs_buf_item_unpin+0x8d/0x380
[ 6559.134213]  xfs_buf_lock+0x3d/0xe0
[ 6559.136544]  xfs_buf_item_unpin+0x8d/0x380
[ 6559.139253]  xlog_cil_committed+0x287/0x520
[ 6559.142019]  ? sched_clock+0x10/0x30
[ 6559.144384]  ? sched_clock_cpu+0x10/0x190
[ 6559.147039]  ? psi_group_change+0x48/0x310
[ 6559.149735]  ? _raw_spin_unlock+0xe/0x30
[ 6559.152340]  ? finish_task_switch+0xbc/0x310
[ 6559.155163]  xlog_cil_process_committed+0x6d/0x90
[ 6559.158265]  xlog_state_shutdown_callbacks+0x53/0x110
[ 6559.161564]  ? xlog_cil_push_work+0xa70/0xaf0
[ 6559.164441]  xlog_state_release_iclog+0xba/0x1b0
[ 6559.167483]  xlog_cil_push_work+0xa70/0xaf0
[ 6559.170260]  process_scheduled_works+0x1d4/0x400
[ 6559.173286]  worker_thread+0x234/0x2e0
[ 6559.175779]  kthread+0x147/0x170
[ 6559.177933]  ? __pfx_worker_thread+0x10/0x10
[ 6559.180748]  ? __pfx_kthread+0x10/0x10
[ 6559.183231]  ret_from_fork+0x3e/0x50
[ 6559.185601]  ? __pfx_kthread+0x10/0x10
[ 6559.188092]  ret_from_fork_asm+0x1a/0x30
[ 6559.190692]  </TASK>

This is an ABBA deadlock where buffer IO completion is triggering a
forced shutdown with the buffer lock held. It is waiting for the CIL
to flush as part of the log force. The CIL flush is blocked doing
shutdown processing of all it's objects, trying to unpin a buffer
item. That requires taking the buffer lock....

For the CIL to be doing shutdown processing, the log must be marked
with XLOG_IO_ERROR, but that doesn't happen until after the log
force is issued. Hence for xfs_do_force_shutdown() to be forcing
the log on a shut down log, we must have had a racing
xlog_force_shutdown and xfs_force_shutdown like so:

p0			p1			CIL push

   			<holds buffer lock>
xlog_force_shutdown
  xfs_log_force
   test_and_set_bit(XLOG_IO_ERROR)
   						xlog_state_release_iclog()
						  sees XLOG_IO_ERROR
						  xlog_state_shutdown_callbacks
						    ....
						    xfs_buf_item_unpin
						    xfs_buf_lock
						    <blocks on buffer p1 holds>

   			xfs_force_shutdown
			  xfs_set_shutdown(mp) wins
			    xlog_force_shutdown
			      xfs_log_force
			        <blocks on CIL push>

  xfs_set_shutdown(mp) fails
  <shuts down rest of log>

The deadlock can be mitigated by avoiding the log force on the
second pass through xlog_force_shutdown. Do this by adding another
atomic state bit (XLOG_OP_PENDING_SHUTDOWN) that is set on entry to
xlog_force_shutdown() but doesn't mark the log as shutdown.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log.c      | 11 +++++++++++
 fs/xfs/xfs_log_priv.h |  1 +
 2 files changed, 12 insertions(+)

Comments

Darrick J. Wong Nov. 12, 2024, 11:58 p.m. UTC | #1

On Wed, Nov 13, 2024 at 09:05:16AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> I recently had an fstests hang where there were two internal tasks
> stuck like so:
> 
> [ 6559.010870] task:kworker/24:45   state:D stack:12152 pid:631308 tgid:631308 ppid:2      flags:0x00004000
> [ 6559.016984] Workqueue: xfs-buf/dm-2 xfs_buf_ioend_work
> [ 6559.020349] Call Trace:
> [ 6559.022002]  <TASK>
> [ 6559.023426]  __schedule+0x650/0xb10
> [ 6559.025734]  schedule+0x6d/0xf0
> [ 6559.027835]  schedule_timeout+0x31/0x180
> [ 6559.030582]  wait_for_common+0x10c/0x1e0
> [ 6559.033495]  wait_for_completion+0x1d/0x30
> [ 6559.036463]  __flush_workqueue+0xeb/0x490
> [ 6559.039479]  ? mempool_alloc_slab+0x15/0x20
> [ 6559.042537]  xlog_cil_force_seq+0xa1/0x2f0
> [ 6559.045498]  ? bio_alloc_bioset+0x1d8/0x510
> [ 6559.048578]  ? submit_bio_noacct+0x2f2/0x380
> [ 6559.051665]  ? xlog_force_shutdown+0x3b/0x170
> [ 6559.054819]  xfs_log_force+0x77/0x230
> [ 6559.057455]  xlog_force_shutdown+0x3b/0x170
> [ 6559.060507]  xfs_do_force_shutdown+0xd4/0x200
> [ 6559.063798]  ? xfs_buf_rele+0x1bd/0x580
> [ 6559.066541]  xfs_buf_ioend_handle_error+0x163/0x2e0
> [ 6559.070099]  xfs_buf_ioend+0x61/0x200
> [ 6559.072728]  xfs_buf_ioend_work+0x15/0x20
> [ 6559.075706]  process_scheduled_works+0x1d4/0x400
> [ 6559.078814]  worker_thread+0x234/0x2e0
> [ 6559.081300]  kthread+0x147/0x170
> [ 6559.083462]  ? __pfx_worker_thread+0x10/0x10
> [ 6559.086295]  ? __pfx_kthread+0x10/0x10
> [ 6559.088771]  ret_from_fork+0x3e/0x50
> [ 6559.091153]  ? __pfx_kthread+0x10/0x10
> [ 6559.093624]  ret_from_fork_asm+0x1a/0x30
> [ 6559.096227]  </TASK>
> 
> [ 6559.109304] Workqueue: xfs-cil/dm-2 xlog_cil_push_work
> [ 6559.112673] Call Trace:
> [ 6559.114333]  <TASK>
> [ 6559.115760]  __schedule+0x650/0xb10
> [ 6559.118084]  schedule+0x6d/0xf0
> [ 6559.120175]  schedule_timeout+0x31/0x180
> [ 6559.122776]  ? call_rcu+0xee/0x2f0
> [ 6559.125034]  __down_common+0xbe/0x1f0
> [ 6559.127470]  __down+0x1d/0x30
> [ 6559.129458]  down+0x48/0x50
> [ 6559.131343]  ? xfs_buf_item_unpin+0x8d/0x380
> [ 6559.134213]  xfs_buf_lock+0x3d/0xe0
> [ 6559.136544]  xfs_buf_item_unpin+0x8d/0x380
> [ 6559.139253]  xlog_cil_committed+0x287/0x520
> [ 6559.142019]  ? sched_clock+0x10/0x30
> [ 6559.144384]  ? sched_clock_cpu+0x10/0x190
> [ 6559.147039]  ? psi_group_change+0x48/0x310
> [ 6559.149735]  ? _raw_spin_unlock+0xe/0x30
> [ 6559.152340]  ? finish_task_switch+0xbc/0x310
> [ 6559.155163]  xlog_cil_process_committed+0x6d/0x90
> [ 6559.158265]  xlog_state_shutdown_callbacks+0x53/0x110
> [ 6559.161564]  ? xlog_cil_push_work+0xa70/0xaf0
> [ 6559.164441]  xlog_state_release_iclog+0xba/0x1b0
> [ 6559.167483]  xlog_cil_push_work+0xa70/0xaf0
> [ 6559.170260]  process_scheduled_works+0x1d4/0x400
> [ 6559.173286]  worker_thread+0x234/0x2e0
> [ 6559.175779]  kthread+0x147/0x170
> [ 6559.177933]  ? __pfx_worker_thread+0x10/0x10
> [ 6559.180748]  ? __pfx_kthread+0x10/0x10
> [ 6559.183231]  ret_from_fork+0x3e/0x50
> [ 6559.185601]  ? __pfx_kthread+0x10/0x10
> [ 6559.188092]  ret_from_fork_asm+0x1a/0x30
> [ 6559.190692]  </TASK>
> 
> This is an ABBA deadlock where buffer IO completion is triggering a
> forced shutdown with the buffer lock held. It is waiting for the CIL
> to flush as part of the log force. The CIL flush is blocked doing
> shutdown processing of all it's objects, trying to unpin a buffer
> item. That requires taking the buffer lock....
> 
> For the CIL to be doing shutdown processing, the log must be marked
> with XLOG_IO_ERROR, but that doesn't happen until after the log
> force is issued. Hence for xfs_do_force_shutdown() to be forcing
> the log on a shut down log, we must have had a racing
> xlog_force_shutdown and xfs_force_shutdown like so:
> 
> p0			p1			CIL push
> 
>    			<holds buffer lock>
> xlog_force_shutdown
>   xfs_log_force
>    test_and_set_bit(XLOG_IO_ERROR)
>    						xlog_state_release_iclog()
> 						  sees XLOG_IO_ERROR
> 						  xlog_state_shutdown_callbacks
> 						    ....
> 						    xfs_buf_item_unpin
> 						    xfs_buf_lock
> 						    <blocks on buffer p1 holds>
> 
>    			xfs_force_shutdown
> 			  xfs_set_shutdown(mp) wins
> 			    xlog_force_shutdown
> 			      xfs_log_force
> 			        <blocks on CIL push>
> 
>   xfs_set_shutdown(mp) fails
>   <shuts down rest of log>

Huh.  I wonder, what happens today if there are multiple threads all
trying to shut down the log?  Same thing?  I guess we've never really
gone Farmer Brown's Bad Day[1] on this part of the fs.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

[1] https://www.gocomics.com/calvinandhobbes/1993/04/11

> The deadlock can be mitigated by avoiding the log force on the
> second pass through xlog_force_shutdown. Do this by adding another
> atomic state bit (XLOG_OP_PENDING_SHUTDOWN) that is set on entry to
> xlog_force_shutdown() but doesn't mark the log as shutdown.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_log.c      | 11 +++++++++++
>  fs/xfs/xfs_log_priv.h |  1 +
>  2 files changed, 12 insertions(+)
> 
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index 26b2f5887b88..05daad8a8d34 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -3455,6 +3455,16 @@ xlog_force_shutdown(
>  	if (!log)
>  		return false;
>  
> +	/*
> +	 * Ensure that there is only ever one log shutdown being processed.
> +	 * If we allow the log force below on a second pass after shutting
> +	 * down the log, we risk deadlocking the CIL push as it may require
> +	 * locks on objects the current shutdown context holds (e.g. taking
> +	 * buffer locks to abort buffers on last unpin of buf log items).
> +	 */
> +	if (test_and_set_bit(XLOG_SHUTDOWN_STARTED, &log->l_opstate))
> +		return false;
> +
>  	/*
>  	 * Flush all the completed transactions to disk before marking the log
>  	 * being shut down. We need to do this first as shutting down the log
> @@ -3487,6 +3497,7 @@ xlog_force_shutdown(
>  	spin_lock(&log->l_icloglock);
>  	if (test_and_set_bit(XLOG_IO_ERROR, &log->l_opstate)) {
>  		spin_unlock(&log->l_icloglock);
> +		ASSERT(0);
>  		return false;
>  	}
>  	spin_unlock(&log->l_icloglock);
> diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
> index b8778a4fd6b6..f3d78869e5e5 100644
> --- a/fs/xfs/xfs_log_priv.h
> +++ b/fs/xfs/xfs_log_priv.h
> @@ -458,6 +458,7 @@ struct xlog {
>  #define XLOG_IO_ERROR		2	/* log hit an I/O error, and being
>  				   shutdown */
>  #define XLOG_TAIL_WARN		3	/* log tail verify warning issued */
> +#define XLOG_SHUTDOWN_STARTED	4	/* xlog_force_shutdown() exclusion */
>  
>  static inline bool
>  xlog_recovery_needed(struct xlog *log)
> -- 
> 2.45.2
> 
>

Dave Chinner Nov. 13, 2024, 12:56 a.m. UTC | #2

On Tue, Nov 12, 2024 at 03:58:08PM -0800, Darrick J. Wong wrote:
> On Wed, Nov 13, 2024 at 09:05:16AM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > I recently had an fstests hang where there were two internal tasks
> > stuck like so:
....
> > For the CIL to be doing shutdown processing, the log must be marked
> > with XLOG_IO_ERROR, but that doesn't happen until after the log
> > force is issued. Hence for xfs_do_force_shutdown() to be forcing
> > the log on a shut down log, we must have had a racing
> > xlog_force_shutdown and xfs_force_shutdown like so:
> > 
> > p0			p1			CIL push
> > 
> >    			<holds buffer lock>
> > xlog_force_shutdown
> >   xfs_log_force
> >    test_and_set_bit(XLOG_IO_ERROR)
> >    						xlog_state_release_iclog()
> > 						  sees XLOG_IO_ERROR
> > 						  xlog_state_shutdown_callbacks
> > 						    ....
> > 						    xfs_buf_item_unpin
> > 						    xfs_buf_lock
> > 						    <blocks on buffer p1 holds>
> > 
> >    			xfs_force_shutdown
> > 			  xfs_set_shutdown(mp) wins
> > 			    xlog_force_shutdown
> > 			      xfs_log_force
> > 			        <blocks on CIL push>
> > 
> >   xfs_set_shutdown(mp) fails
> >   <shuts down rest of log>
> 
> Huh.  I wonder, what happens today if there are multiple threads all
> trying to shut down the log?  Same thing?

Yes. Anywhere that a both a log shutdown and a mount shutdown can be
called concurrently and one of them holds a locked buffer that is
also dirty in the CIL can trip over this. When I first saw it I
thought "calling shutdown with a locked buffer is bad", then
realised that we do that -everywhere- and assume it is safe to do
so. That's when I started looking deeper and found this....

> I guess we've never really
> gone Farmer Brown's Bad Day[1] on this part of the fs.

Oh, running ~64 individual fstests concurrently on the same VM does
a good imitation of that.

$ time sudo ./check-parallel /mnt/xfs -s xfs -x dump
Tests run: 1143
Failure count: 11

real    9m12.307s
user    0m0.007s
sys     0m0.013s
$

That's what's finding these little weird timing-related issues. I've got
several other repeating issues that I haven't got to the bottom of
yet, so Farmer Brown's Bad Day is not over yet...

-Dave.

Christoph Hellwig Nov. 13, 2024, 8:50 a.m. UTC | #3

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index 26b2f5887b88..05daad8a8d34 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -3455,6 +3455,16 @@  xlog_force_shutdown(
 	if (!log)
 		return false;
 
+	/*
+	 * Ensure that there is only ever one log shutdown being processed.
+	 * If we allow the log force below on a second pass after shutting
+	 * down the log, we risk deadlocking the CIL push as it may require
+	 * locks on objects the current shutdown context holds (e.g. taking
+	 * buffer locks to abort buffers on last unpin of buf log items).
+	 */
+	if (test_and_set_bit(XLOG_SHUTDOWN_STARTED, &log->l_opstate))
+		return false;
+
 	/*
 	 * Flush all the completed transactions to disk before marking the log
 	 * being shut down. We need to do this first as shutting down the log
@@ -3487,6 +3497,7 @@  xlog_force_shutdown(
 	spin_lock(&log->l_icloglock);
 	if (test_and_set_bit(XLOG_IO_ERROR, &log->l_opstate)) {
 		spin_unlock(&log->l_icloglock);
+		ASSERT(0);
 		return false;
 	}
 	spin_unlock(&log->l_icloglock);
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index b8778a4fd6b6..f3d78869e5e5 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -458,6 +458,7 @@  struct xlog {
 #define XLOG_IO_ERROR		2	/* log hit an I/O error, and being
 				   shutdown */
 #define XLOG_TAIL_WARN		3	/* log tail verify warning issued */
+#define XLOG_SHUTDOWN_STARTED	4	/* xlog_force_shutdown() exclusion */
 
 static inline bool
 xlog_recovery_needed(struct xlog *log)

[3/3] xfs: prevent mount and log shutdown race

Commit Message

Comments

Patch