diff mbox series

[1/2] xfs: flush inodegc workqueue tasks before cancel

Message ID 20220113133701.629593-2-bfoster@redhat.com (mailing list archive)
State Accepted, archived
Headers show
Series xfs: a couple misc/small deferred inactivation tweaks | expand

Commit Message

Brian Foster Jan. 13, 2022, 1:37 p.m. UTC
The xfs_inodegc_stop() helper performs a high level flush of pending
work on the percpu queues and then runs a cancel_work_sync() on each
of the percpu work tasks to ensure all work has completed before
returning.  While cancel_work_sync() waits for wq tasks to complete,
it does not guarantee work tasks have started. This means that the
_stop() helper can queue and instantly cancel a wq task without
having completed the associated work. This can be observed by
tracepoint inspection of a simple "rm -f <file>; fsfreeze -f <mnt>"
test:

	xfs_destroy_inode: ... ino 0x83 ...
	xfs_inode_set_need_inactive: ... ino 0x83 ...
	xfs_inodegc_stop: ...
	...
	xfs_inodegc_start: ...
	xfs_inodegc_worker: ...
	xfs_inode_inactivating: ... ino 0x83 ...

The first few lines show that the inode is removed and need inactive
state set, but the inactivation work has not completed before the
inodegc mechanism stops. The inactivation doesn't actually occur
until the fs is unfrozen and the gc mechanism starts back up. Note
that this test requires fsfreeze to reproduce because xfs_freeze
indirectly invokes xfs_fs_statfs(), which calls xfs_inodegc_flush().

When this occurs, the workqueue try_to_grab_pending() logic first
tries to steal the pending bit, which does not succeed because the
bit has been set by queue_work_on(). Subsequently, it checks for
association of a pool workqueue from the work item under the pool
lock. This association is set at the point a work item is queued and
cleared when dequeued for processing. If the association exists, the
work item is removed from the queue and cancel_work_sync() returns
true. If the pwq association is cleared, the remove attempt assumes
the task is busy and retries (eventually returning false to the
caller after waiting for the work task to complete).

To avoid this race, we can flush each work item explicitly before
cancel. However, since the _queue_all() already schedules each
underlying work item, the workqueue level helpers are sufficient to
achieve the same ordering effect. E.g., the inodegc enabled flag
prevents scheduling any further work in the _stop() case. Use the
drain_workqueue() helper in this particular case to make the intent
a bit more self explanatory.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_icache.c | 22 ++++------------------
 1 file changed, 4 insertions(+), 18 deletions(-)

Comments

Darrick J. Wong Jan. 13, 2022, 6:35 p.m. UTC | #1
On Thu, Jan 13, 2022 at 08:37:00AM -0500, Brian Foster wrote:
> The xfs_inodegc_stop() helper performs a high level flush of pending
> work on the percpu queues and then runs a cancel_work_sync() on each
> of the percpu work tasks to ensure all work has completed before
> returning.  While cancel_work_sync() waits for wq tasks to complete,
> it does not guarantee work tasks have started. This means that the
> _stop() helper can queue and instantly cancel a wq task without
> having completed the associated work. This can be observed by
> tracepoint inspection of a simple "rm -f <file>; fsfreeze -f <mnt>"
> test:
> 
> 	xfs_destroy_inode: ... ino 0x83 ...
> 	xfs_inode_set_need_inactive: ... ino 0x83 ...
> 	xfs_inodegc_stop: ...
> 	...
> 	xfs_inodegc_start: ...
> 	xfs_inodegc_worker: ...
> 	xfs_inode_inactivating: ... ino 0x83 ...
> 
> The first few lines show that the inode is removed and need inactive
> state set, but the inactivation work has not completed before the
> inodegc mechanism stops. The inactivation doesn't actually occur
> until the fs is unfrozen and the gc mechanism starts back up. Note
> that this test requires fsfreeze to reproduce because xfs_freeze
> indirectly invokes xfs_fs_statfs(), which calls xfs_inodegc_flush().
> 
> When this occurs, the workqueue try_to_grab_pending() logic first
> tries to steal the pending bit, which does not succeed because the
> bit has been set by queue_work_on(). Subsequently, it checks for
> association of a pool workqueue from the work item under the pool
> lock. This association is set at the point a work item is queued and
> cleared when dequeued for processing. If the association exists, the
> work item is removed from the queue and cancel_work_sync() returns
> true. If the pwq association is cleared, the remove attempt assumes
> the task is busy and retries (eventually returning false to the
> caller after waiting for the work task to complete).
> 
> To avoid this race, we can flush each work item explicitly before
> cancel. However, since the _queue_all() already schedules each
> underlying work item, the workqueue level helpers are sufficient to
> achieve the same ordering effect. E.g., the inodegc enabled flag
> prevents scheduling any further work in the _stop() case. Use the
> drain_workqueue() helper in this particular case to make the intent
> a bit more self explanatory.

/me wonders why he didn't think of drain/flush_workqueue in the first
place.  Hm.  Well, the logic looks sound so,

Reviewed-by: Darrick J. Wong <djwong@kernel.org>

and I'll go give this a spin.

--D

> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>  fs/xfs/xfs_icache.c | 22 ++++------------------
>  1 file changed, 4 insertions(+), 18 deletions(-)
> 
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index d019c98eb839..7a2a5e2be3cf 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -1852,28 +1852,20 @@ xfs_inodegc_worker(
>  }
>  
>  /*
> - * Force all currently queued inode inactivation work to run immediately, and
> - * wait for the work to finish. Two pass - queue all the work first pass, wait
> - * for it in a second pass.
> + * Force all currently queued inode inactivation work to run immediately and
> + * wait for the work to finish.
>   */
>  void
>  xfs_inodegc_flush(
>  	struct xfs_mount	*mp)
>  {
> -	struct xfs_inodegc	*gc;
> -	int			cpu;

>  	if (!xfs_is_inodegc_enabled(mp))
>  		return;
>  
>  	trace_xfs_inodegc_flush(mp, __return_address);
>  
>  	xfs_inodegc_queue_all(mp);
> -
> -	for_each_online_cpu(cpu) {
> -		gc = per_cpu_ptr(mp->m_inodegc, cpu);
> -		flush_work(&gc->work);
> -	}
> +	flush_workqueue(mp->m_inodegc_wq);
>  }
>  
>  /*
> @@ -1884,18 +1876,12 @@ void
>  xfs_inodegc_stop(
>  	struct xfs_mount	*mp)
>  {
> -	struct xfs_inodegc	*gc;
> -	int			cpu;
> -
>  	if (!xfs_clear_inodegc_enabled(mp))
>  		return;
>  
>  	xfs_inodegc_queue_all(mp);
> +	drain_workqueue(mp->m_inodegc_wq);
>  
> -	for_each_online_cpu(cpu) {
> -		gc = per_cpu_ptr(mp->m_inodegc, cpu);
> -		cancel_work_sync(&gc->work);
> -	}
>  	trace_xfs_inodegc_stop(mp, __return_address);
>  }
>  
> -- 
> 2.31.1
>
Dave Chinner Jan. 13, 2022, 10:19 p.m. UTC | #2
On Thu, Jan 13, 2022 at 08:37:00AM -0500, Brian Foster wrote:
> The xfs_inodegc_stop() helper performs a high level flush of pending
> work on the percpu queues and then runs a cancel_work_sync() on each
> of the percpu work tasks to ensure all work has completed before
> returning.  While cancel_work_sync() waits for wq tasks to complete,
> it does not guarantee work tasks have started. This means that the
> _stop() helper can queue and instantly cancel a wq task without
> having completed the associated work. This can be observed by
> tracepoint inspection of a simple "rm -f <file>; fsfreeze -f <mnt>"
> test:
> 
> 	xfs_destroy_inode: ... ino 0x83 ...
> 	xfs_inode_set_need_inactive: ... ino 0x83 ...
> 	xfs_inodegc_stop: ...
> 	...
> 	xfs_inodegc_start: ...
> 	xfs_inodegc_worker: ...
> 	xfs_inode_inactivating: ... ino 0x83 ...
> 
> The first few lines show that the inode is removed and need inactive
> state set, but the inactivation work has not completed before the
> inodegc mechanism stops. The inactivation doesn't actually occur
> until the fs is unfrozen and the gc mechanism starts back up. Note
> that this test requires fsfreeze to reproduce because xfs_freeze
> indirectly invokes xfs_fs_statfs(), which calls xfs_inodegc_flush().
> 
> When this occurs, the workqueue try_to_grab_pending() logic first
> tries to steal the pending bit, which does not succeed because the
> bit has been set by queue_work_on(). Subsequently, it checks for
> association of a pool workqueue from the work item under the pool
> lock. This association is set at the point a work item is queued and
> cleared when dequeued for processing. If the association exists, the
> work item is removed from the queue and cancel_work_sync() returns
> true. If the pwq association is cleared, the remove attempt assumes
> the task is busy and retries (eventually returning false to the
> caller after waiting for the work task to complete).
> 
> To avoid this race, we can flush each work item explicitly before
> cancel. However, since the _queue_all() already schedules each
> underlying work item, the workqueue level helpers are sufficient to
> achieve the same ordering effect. E.g., the inodegc enabled flag
> prevents scheduling any further work in the _stop() case. Use the
> drain_workqueue() helper in this particular case to make the intent
> a bit more self explanatory.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>  fs/xfs/xfs_icache.c | 22 ++++------------------
>  1 file changed, 4 insertions(+), 18 deletions(-)

Looks good - flush/drain_workqueue() are much nicer was of doing
this.

Reviewed-by: Dave Chinner <dchinner@redhat.com>
diff mbox series

Patch

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index d019c98eb839..7a2a5e2be3cf 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -1852,28 +1852,20 @@  xfs_inodegc_worker(
 }
 
 /*
- * Force all currently queued inode inactivation work to run immediately, and
- * wait for the work to finish. Two pass - queue all the work first pass, wait
- * for it in a second pass.
+ * Force all currently queued inode inactivation work to run immediately and
+ * wait for the work to finish.
  */
 void
 xfs_inodegc_flush(
 	struct xfs_mount	*mp)
 {
-	struct xfs_inodegc	*gc;
-	int			cpu;
-
 	if (!xfs_is_inodegc_enabled(mp))
 		return;
 
 	trace_xfs_inodegc_flush(mp, __return_address);
 
 	xfs_inodegc_queue_all(mp);
-
-	for_each_online_cpu(cpu) {
-		gc = per_cpu_ptr(mp->m_inodegc, cpu);
-		flush_work(&gc->work);
-	}
+	flush_workqueue(mp->m_inodegc_wq);
 }
 
 /*
@@ -1884,18 +1876,12 @@  void
 xfs_inodegc_stop(
 	struct xfs_mount	*mp)
 {
-	struct xfs_inodegc	*gc;
-	int			cpu;
-
 	if (!xfs_clear_inodegc_enabled(mp))
 		return;
 
 	xfs_inodegc_queue_all(mp);
+	drain_workqueue(mp->m_inodegc_wq);
 
-	for_each_online_cpu(cpu) {
-		gc = per_cpu_ptr(mp->m_inodegc, cpu);
-		cancel_work_sync(&gc->work);
-	}
 	trace_xfs_inodegc_stop(mp, __return_address);
 }