diff mbox series

xfs: fix race condition in inodegc list and cpumask handling

Message ID 20241125015258.2652325-1-leo.lilong@huawei.com (mailing list archive)
State New
Headers show
Series xfs: fix race condition in inodegc list and cpumask handling | expand

Commit Message

Long Li Nov. 25, 2024, 1:52 a.m. UTC
There is a race condition between inodegc queue and inodegc worker where
the cpumask bit may not be set when concurrent operations occur.

Current problematic sequence:

  CPU0                             CPU1
  --------------------             ---------------------
  xfs_inodegc_queue()              xfs_inodegc_worker()
                                     llist_del_all(&gc->list)
    llist_add(&ip->i_gclist, &gc->list)
    cpumask_test_and_set_cpu()
                                     cpumask_clear_cpu()
                  < cpumask not set >

Fix this by moving llist_del_all() after cpumask_clear_cpu() to ensure
proper ordering. This change ensures that when the worker thread clears
the cpumask, any concurrent queue operations will either properly set
the cpumask bit or have already emptied the list.

Also remove unnecessary smp_mb__{before/after}_atomic() barriers since
the llist_* operations already provide required ordering semantics. it
make the code cleaner.

Fixes: 62334fab4762 ("xfs: use per-mount cpumask to track nonempty percpu inodegc lists")
Reported-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Long Li <leo.lilong@huawei.com>
---
 fs/xfs/xfs_icache.c | 25 ++++++++++++-------------
 1 file changed, 12 insertions(+), 13 deletions(-)

Comments

Dave Chinner Nov. 25, 2024, 9:03 p.m. UTC | #1
On Mon, Nov 25, 2024 at 09:52:58AM +0800, Long Li wrote:
> There is a race condition between inodegc queue and inodegc worker where
> the cpumask bit may not be set when concurrent operations occur.

What problems does this cause? i.e. how do we identify systems
hitting this issue?

> 
> Current problematic sequence:
> 
>   CPU0                             CPU1
>   --------------------             ---------------------
>   xfs_inodegc_queue()              xfs_inodegc_worker()
>                                      llist_del_all(&gc->list)
>     llist_add(&ip->i_gclist, &gc->list)
>     cpumask_test_and_set_cpu()
>                                      cpumask_clear_cpu()
>                   < cpumask not set >
> 
> Fix this by moving llist_del_all() after cpumask_clear_cpu() to ensure
> proper ordering. This change ensures that when the worker thread clears
> the cpumask, any concurrent queue operations will either properly set
> the cpumask bit or have already emptied the list.
> 
> Also remove unnecessary smp_mb__{before/after}_atomic() barriers since
> the llist_* operations already provide required ordering semantics. it
> make the code cleaner.

IIRC, the barriers were for ordering the cpumask bitmap ops against
llist operations. There are calls elsewhere to for_each_cpu() that
then use llist_empty() checks (e.g xfs_inodegc_queue_all/wait_all),
so on relaxed architectures (like alpha) I think we have to ensure
the bitmask ops carried full ordering against the independent llist
ops themselves. i.e. llist_empty() just uses READ_ONCE, so it only
orders against other llist ops and won't guarantee any specific
ordering against against cpumask modifications.

I could be remembering incorrectly, but I think that was the
original reason for the barriers. Can you please confirm that the
cpumask iteration/llist_empty checks do not need these bitmask
barriers anymore? If that's ok, then the change looks fine.

-Dave.
diff mbox series

Patch

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 7b6c026d01a1..fba784d7a146 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -1948,19 +1948,20 @@  xfs_inodegc_worker(
 {
 	struct xfs_inodegc	*gc = container_of(to_delayed_work(work),
 						struct xfs_inodegc, work);
-	struct llist_node	*node = llist_del_all(&gc->list);
+	struct llist_node	*node;
 	struct xfs_inode	*ip, *n;
 	struct xfs_mount	*mp = gc->mp;
 	unsigned int		nofs_flag;
 
+	cpumask_clear_cpu(gc->cpu, &mp->m_inodegc_cpumask);
+
 	/*
-	 * Clear the cpu mask bit and ensure that we have seen the latest
-	 * update of the gc structure associated with this CPU. This matches
-	 * with the release semantics used when setting the cpumask bit in
-	 * xfs_inodegc_queue.
+	 * llist_del_all provides ordering semantics that ensure the CPU mask
+	 * clearing is always visible before removing all entries from the gc
+	 * list. This prevents list being added while the CPU mask bit is
+	 * unset in xfs_inodegc_queue()
 	 */
-	cpumask_clear_cpu(gc->cpu, &mp->m_inodegc_cpumask);
-	smp_mb__after_atomic();
+	node = llist_del_all(&gc->list);
 
 	WRITE_ONCE(gc->items, 0);
 
@@ -2188,13 +2189,11 @@  xfs_inodegc_queue(
 	shrinker_hits = READ_ONCE(gc->shrinker_hits);
 
 	/*
-	 * Ensure the list add is always seen by anyone who finds the cpumask
-	 * bit set. This effectively gives the cpumask bit set operation
-	 * release ordering semantics.
+	 * llist_add provides necessary ordering semantics to ensure list
+	 * additions are visible when cpumask bit is found set, so no
+	 * additional memory barrier is needed.
 	 */
-	smp_mb__before_atomic();
-	if (!cpumask_test_cpu(cpu_nr, &mp->m_inodegc_cpumask))
-		cpumask_test_and_set_cpu(cpu_nr, &mp->m_inodegc_cpumask);
+	cpumask_test_and_set_cpu(cpu_nr, &mp->m_inodegc_cpumask);
 
 	/*
 	 * We queue the work while holding the current CPU so that the work