diff mbox series

[2/2] md: fix deadlock between mddev_suspend and flush bio

Message ID 20240525185257.3896201-3-linan666@huaweicloud.com (mailing list archive)
State Accepted, archived
Headers show
Series md: flush deadlock bugfix | expand

Commit Message

Li Nan May 25, 2024, 6:52 p.m. UTC
From: Li Nan <linan122@huawei.com>

Deadlock occurs when mddev is being suspended while some flush bio is in
progress. It is a complex issue.

T1. the first flush is at the ending stage, it clears 'mddev->flush_bio'
    and tries to submit data, but is blocked because mddev is suspended
    by T4.
T2. the second flush sets 'mddev->flush_bio', and attempts to queue
    md_submit_flush_data(), which is already running (T1) and won't
    execute again if on the same CPU as T1.
T3. the third flush inc active_io and tries to flush, but is blocked because
    'mddev->flush_bio' is not NULL (set by T2).
T4. mddev_suspend() is called and waits for active_io dec to 0 which is inc
    by T3.

  T1		T2		T3		T4
  (flush 1)	(flush 2)	(third 3)	(suspend)
  md_submit_flush_data
   mddev->flush_bio = NULL;
   .
   .	 	md_flush_request
   .	  	 mddev->flush_bio = bio
   .	  	 queue submit_flushes
   .		 .
   .		 .		md_handle_request
   .		 .		 active_io + 1
   .		 .		 md_flush_request
   .		 .		  wait !mddev->flush_bio
   .		 .
   .		 .				mddev_suspend
   .		 .				 wait !active_io
   .		 .
   .		 submit_flushes
   .		 queue_work md_submit_flush_data
   .		 //md_submit_flush_data is already running (T1)
   .
   md_handle_request
    wait resume

The root issue is non-atomic inc/dec of active_io during flush process.
active_io is dec before md_submit_flush_data is queued, and inc soon
after md_submit_flush_data() run.
  md_flush_request
    active_io + 1
    submit_flushes
      active_io - 1
      md_submit_flush_data
        md_handle_request
        active_io + 1
          make_request
        active_io - 1

If active_io is dec after md_handle_request() instead of within
submit_flushes(), make_request() can be called directly intead of
md_handle_request() in md_submit_flush_data(), and active_io will
only inc and dec once in the whole flush process. Deadlock will be
fixed.

Additionally, the only difference between fixing the issue and before is
that there is no return error handling of make_request(). But after
previous patch cleaned md_write_start(), make_requst() only return error
in raid5_make_request() by dm-raid, see commit 41425f96d7aa ("dm-raid456,
md/raid456: fix a deadlock for dm-raid456 while io concurrent with
reshape)". Since dm always splits data and flush operation into two
separate io, io size of flush submitted by dm always is 0, make_request()
will not be called in md_submit_flush_data(). To prevent future
modifications from introducing issues, add WARN_ON to ensure
make_request() no error is returned in this context.

Fixes: fa2bbff7b0b4 ("md: synchronize flush io with array reconfiguration")
Signed-off-by: Li Nan <linan122@huawei.com>
---
 drivers/md/md.c | 26 +++++++++++++++-----------
 1 file changed, 15 insertions(+), 11 deletions(-)

Comments

Yu Kuai May 28, 2024, 1:12 p.m. UTC | #1
Hi,

在 2024/05/26 2:52, linan666@huaweicloud.com 写道:
> From: Li Nan <linan122@huawei.com>
> 
> Deadlock occurs when mddev is being suspended while some flush bio is in
> progress. It is a complex issue.
> 
> T1. the first flush is at the ending stage, it clears 'mddev->flush_bio'
>      and tries to submit data, but is blocked because mddev is suspended
>      by T4.
> T2. the second flush sets 'mddev->flush_bio', and attempts to queue
>      md_submit_flush_data(), which is already running (T1) and won't
>      execute again if on the same CPU as T1.
> T3. the third flush inc active_io and tries to flush, but is blocked because
>      'mddev->flush_bio' is not NULL (set by T2).
> T4. mddev_suspend() is called and waits for active_io dec to 0 which is inc
>      by T3.
> 
>    T1		T2		T3		T4
>    (flush 1)	(flush 2)	(third 3)	(suspend)
>    md_submit_flush_data
>     mddev->flush_bio = NULL;
>     .
>     .	 	md_flush_request
>     .	  	 mddev->flush_bio = bio
>     .	  	 queue submit_flushes
>     .		 .
>     .		 .		md_handle_request
>     .		 .		 active_io + 1
>     .		 .		 md_flush_request
>     .		 .		  wait !mddev->flush_bio
>     .		 .
>     .		 .				mddev_suspend
>     .		 .				 wait !active_io
>     .		 .
>     .		 submit_flushes
>     .		 queue_work md_submit_flush_data
>     .		 //md_submit_flush_data is already running (T1)
>     .
>     md_handle_request
>      wait resume
> 
> The root issue is non-atomic inc/dec of active_io during flush process.
> active_io is dec before md_submit_flush_data is queued, and inc soon
> after md_submit_flush_data() run.
>    md_flush_request
>      active_io + 1
>      submit_flushes
>        active_io - 1
>        md_submit_flush_data
>          md_handle_request
>          active_io + 1
>            make_request
>          active_io - 1
> 
> If active_io is dec after md_handle_request() instead of within
> submit_flushes(), make_request() can be called directly intead of
> md_handle_request() in md_submit_flush_data(), and active_io will
> only inc and dec once in the whole flush process. Deadlock will be
> fixed.
> 
> Additionally, the only difference between fixing the issue and before is
> that there is no return error handling of make_request(). But after
> previous patch cleaned md_write_start(), make_requst() only return error
> in raid5_make_request() by dm-raid, see commit 41425f96d7aa ("dm-raid456,
> md/raid456: fix a deadlock for dm-raid456 while io concurrent with
> reshape)". Since dm always splits data and flush operation into two
> separate io, io size of flush submitted by dm always is 0, make_request()
> will not be called in md_submit_flush_data(). To prevent future
> modifications from introducing issues, add WARN_ON to ensure
> make_request() no error is returned in this context.
> 
> Fixes: fa2bbff7b0b4 ("md: synchronize flush io with array reconfiguration")
> Signed-off-by: Li Nan <linan122@huawei.com>

The patch itself looks correct. However, there was a plan to remove
the flush handling and submit the flush bio directly to underlying
disks like dm. Because md_flush_request(), which is fast patch, grab a
disk level spinlock mddev->lock, and will affect performance.

I'm fine taking this patch first, I'll leave the decision to Song.

Thanks,
Kuai


> ---
>   drivers/md/md.c | 26 +++++++++++++++-----------
>   1 file changed, 15 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 14d6e615bcbb..9bb7e627e57f 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -550,13 +550,9 @@ static void md_end_flush(struct bio *bio)
>   
>   	rdev_dec_pending(rdev, mddev);
>   
> -	if (atomic_dec_and_test(&mddev->flush_pending)) {
> -		/* The pair is percpu_ref_get() from md_flush_request() */
> -		percpu_ref_put(&mddev->active_io);
> -
> +	if (atomic_dec_and_test(&mddev->flush_pending))
>   		/* The pre-request flush has finished */
>   		queue_work(md_wq, &mddev->flush_work);
> -	}
>   }
>   
>   static void md_submit_flush_data(struct work_struct *ws);
> @@ -587,12 +583,8 @@ static void submit_flushes(struct work_struct *ws)
>   			rcu_read_lock();
>   		}
>   	rcu_read_unlock();
> -	if (atomic_dec_and_test(&mddev->flush_pending)) {
> -		/* The pair is percpu_ref_get() from md_flush_request() */
> -		percpu_ref_put(&mddev->active_io);
> -
> +	if (atomic_dec_and_test(&mddev->flush_pending))
>   		queue_work(md_wq, &mddev->flush_work);
> -	}
>   }
>   
>   static void md_submit_flush_data(struct work_struct *ws)
> @@ -617,8 +609,20 @@ static void md_submit_flush_data(struct work_struct *ws)
>   		bio_endio(bio);
>   	} else {
>   		bio->bi_opf &= ~REQ_PREFLUSH;
> -		md_handle_request(mddev, bio);
> +
> +		/*
> +		 * make_requst() will never return error here, it only
> +		 * returns error in raid5_make_request() by dm-raid.
> +		 * Since dm always splits data and flush operation into
> +		 * two separate io, io size of flush submitted by dm
> +		 * always is 0, make_request() will not be called here.
> +		 */
> +		if (WARN_ON_ONCE(!mddev->pers->make_request(mddev, bio)))
> +			bio_io_error(bio);;
>   	}
> +
> +	/* The pair is percpu_ref_get() from md_flush_request() */
> +	percpu_ref_put(&mddev->active_io);
>   }
>   
>   /*
>
Michael Aug. 17, 2024, 10:49 p.m. UTC | #2
git send-email \
    --in-reply-to=20240525185257.3896201-3-linan666@huaweicloud.com \
    --to=linan666@huaweicloud.com \
    --cc=houtao1@huawei.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-raid@vger.kernel.org \
    --cc=song@kernel.org \
    --cc=yangerkun@huawei.com \
    --cc=yi.zhang@huawei.com \
    --cc=yukuai3@huawei.com \
    /path/to/YOUR_REPLY
diff mbox series

Patch

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 14d6e615bcbb..9bb7e627e57f 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -550,13 +550,9 @@  static void md_end_flush(struct bio *bio)
 
 	rdev_dec_pending(rdev, mddev);
 
-	if (atomic_dec_and_test(&mddev->flush_pending)) {
-		/* The pair is percpu_ref_get() from md_flush_request() */
-		percpu_ref_put(&mddev->active_io);
-
+	if (atomic_dec_and_test(&mddev->flush_pending))
 		/* The pre-request flush has finished */
 		queue_work(md_wq, &mddev->flush_work);
-	}
 }
 
 static void md_submit_flush_data(struct work_struct *ws);
@@ -587,12 +583,8 @@  static void submit_flushes(struct work_struct *ws)
 			rcu_read_lock();
 		}
 	rcu_read_unlock();
-	if (atomic_dec_and_test(&mddev->flush_pending)) {
-		/* The pair is percpu_ref_get() from md_flush_request() */
-		percpu_ref_put(&mddev->active_io);
-
+	if (atomic_dec_and_test(&mddev->flush_pending))
 		queue_work(md_wq, &mddev->flush_work);
-	}
 }
 
 static void md_submit_flush_data(struct work_struct *ws)
@@ -617,8 +609,20 @@  static void md_submit_flush_data(struct work_struct *ws)
 		bio_endio(bio);
 	} else {
 		bio->bi_opf &= ~REQ_PREFLUSH;
-		md_handle_request(mddev, bio);
+
+		/*
+		 * make_requst() will never return error here, it only
+		 * returns error in raid5_make_request() by dm-raid.
+		 * Since dm always splits data and flush operation into
+		 * two separate io, io size of flush submitted by dm
+		 * always is 0, make_request() will not be called here.
+		 */
+		if (WARN_ON_ONCE(!mddev->pers->make_request(mddev, bio)))
+			bio_io_error(bio);;
 	}
+
+	/* The pair is percpu_ref_get() from md_flush_request() */
+	percpu_ref_put(&mddev->active_io);
 }
 
 /*