blk-mq: release scheduler resource when request completes

Message ID	20240322174014.373323-1-bvanassche@acm.org (mailing list archive)
State	New, archived
Headers	show Received: from 008.lax.mailroute.net (008.lax.mailroute.net [199.89.1.11]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 805835FB86; Fri, 22 Mar 2024 17:40:58 +0000 (UTC) sender: bvanassche@acm.org) by 008.lax.mailroute.net (Postfix) with ESMTPSA id 4V1V3j3cbcz6Cnk94; Fri, 22 Mar 2024 17:40:49 +0000 (UTC) From: Bart Van Assche <bvanassche@acm.org> To: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jens Axboe <axboe@kernel.dk>, stable@vger.kernel.org, linux-block@vger.kernel.org, Chengming Zhou <zhouchengming@bytedance.com>, kernel test robot <oliver.sang@intel.com>, Chuck Lever <chuck.lever@oracle.com>, Bart Van Assche <bvanassche@acm.org> Subject: [PATCH] blk-mq: release scheduler resource when request completes Date: Fri, 22 Mar 2024 10:40:14 -0700 Message-ID: <20240322174014.373323-1-bvanassche@acm.org> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable
Series	blk-mq: release scheduler resource when request completes \| expand blk-mq: release scheduler resource when request completes

Message ID

20240322174014.373323-1-bvanassche@acm.org (mailing list archive)

State

New, archived

Headers

From: Bart Van Assche <bvanassche@acm.org>
To: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jens Axboe <axboe@kernel.dk>,
	stable@vger.kernel.org,
	linux-block@vger.kernel.org,
	Chengming Zhou <zhouchengming@bytedance.com>,
	kernel test robot <oliver.sang@intel.com>,
	Chuck Lever <chuck.lever@oracle.com>,
	Bart Van Assche <bvanassche@acm.org>
Subject: [PATCH] blk-mq: release scheduler resource when request completes
Date: Fri, 22 Mar 2024 10:40:14 -0700
Message-ID: <20240322174014.373323-1-bvanassche@acm.org>
Precedence: bulk
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable

Series

blk-mq: release scheduler resource when request completes | expand

Commit Message

Bart Van Assche March 22, 2024, 5:40 p.m. UTC

From: Chengming Zhou <zhouchengming@bytedance.com>

commit e5c0ca13659e9d18f53368d651ed7e6e433ec1cf upstream.

Chuck reported [1] an IO hang problem on NFS exports that reside on SATA
devices and bisected to commit 615939a2ae73 ("blk-mq: defer to the normal
submission path for post-flush requests").

We analysed the IO hang problem, found there are two postflush requests
waiting for each other.

The first postflush request completed the REQ_FSEQ_DATA sequence, so go to
the REQ_FSEQ_POSTFLUSH sequence and added in the flush pending list, but
failed to blk_kick_flush() because of the second postflush request which
is inflight waiting in scheduler queue.

The second postflush waiting in scheduler queue can't be dispatched because
the first postflush hasn't released scheduler resource even though it has
completed by itself.

Fix it by releasing scheduler resource when the first postflush request
completed, so the second postflush can be dispatched and completed, then
make blk_kick_flush() succeed.

While at it, remove the check for e->ops.finish_request, as all
schedulers set that. Reaffirm this requirement by adding a WARN_ON_ONCE()
at scheduler registration time, just like we do for insert_requests and
dispatch_request.

[1] https://lore.kernel.org/all/7A57C7AE-A51A-4254-888B-FE15CA21F9E9@oracle.com/

Link: https://lore.kernel.org/linux-block/20230819031206.2744005-1-chengming.zhou@linux.dev/
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202308172100.8ce4b853-oliver.sang@intel.com
Fixes: 615939a2ae73 ("blk-mq: defer to the normal submission path for post-flush requests")
Reported-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Tested-by: Chuck Lever <chuck.lever@oracle.com>
Link: https://lore.kernel.org/r/20230813152325.3017343-1-chengming.zhou@linux.dev
[axboe: folded in incremental fix and added tags]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
[bvanassche: changed RQF_USE_SCHED into RQF_ELVPRIV; restored the
finish_request pointer check before calling finish_request and removed
the new warning from the elevator code. This patch fixes an I/O hang
when submitting a REQ_FUA request to a request queue for a zoned block
device for which FUA has been disabled (QUEUE_FLAG_FUA is not set).]
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/blk-mq.c | 24 +++++++++++++++++++++---
 1 file changed, 21 insertions(+), 3 deletions(-)

Comments

Bart Van Assche March 22, 2024, 5:43 p.m. UTC | #1

On 3/22/24 10:40, Bart Van Assche wrote:
> commit e5c0ca13659e9d18f53368d651ed7e6e433ec1cf upstream.

This backport is intended for the 6.1 stable kernel series.

Thanks,

Bart.

Greg KH March 29, 2024, 1:16 p.m. UTC | #2

On Fri, Mar 22, 2024 at 10:40:14AM -0700, Bart Van Assche wrote:
> From: Chengming Zhou <zhouchengming@bytedance.com>
> 
> commit e5c0ca13659e9d18f53368d651ed7e6e433ec1cf upstream.
> 
> Chuck reported [1] an IO hang problem on NFS exports that reside on SATA
> devices and bisected to commit 615939a2ae73 ("blk-mq: defer to the normal
> submission path for post-flush requests").
> 
> We analysed the IO hang problem, found there are two postflush requests
> waiting for each other.
> 
> The first postflush request completed the REQ_FSEQ_DATA sequence, so go to
> the REQ_FSEQ_POSTFLUSH sequence and added in the flush pending list, but
> failed to blk_kick_flush() because of the second postflush request which
> is inflight waiting in scheduler queue.
> 
> The second postflush waiting in scheduler queue can't be dispatched because
> the first postflush hasn't released scheduler resource even though it has
> completed by itself.
> 
> Fix it by releasing scheduler resource when the first postflush request
> completed, so the second postflush can be dispatched and completed, then
> make blk_kick_flush() succeed.
> 
> While at it, remove the check for e->ops.finish_request, as all
> schedulers set that. Reaffirm this requirement by adding a WARN_ON_ONCE()
> at scheduler registration time, just like we do for insert_requests and
> dispatch_request.
> 
> [1] https://lore.kernel.org/all/7A57C7AE-A51A-4254-888B-FE15CA21F9E9@oracle.com/
> 
> Link: https://lore.kernel.org/linux-block/20230819031206.2744005-1-chengming.zhou@linux.dev/
> Reported-by: kernel test robot <oliver.sang@intel.com>
> Closes: https://lore.kernel.org/oe-lkp/202308172100.8ce4b853-oliver.sang@intel.com
> Fixes: 615939a2ae73 ("blk-mq: defer to the normal submission path for post-flush requests")
> Reported-by: Chuck Lever <chuck.lever@oracle.com>
> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
> Tested-by: Chuck Lever <chuck.lever@oracle.com>
> Link: https://lore.kernel.org/r/20230813152325.3017343-1-chengming.zhou@linux.dev
> [axboe: folded in incremental fix and added tags]
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> [bvanassche: changed RQF_USE_SCHED into RQF_ELVPRIV; restored the
> finish_request pointer check before calling finish_request and removed
> the new warning from the elevator code. This patch fixes an I/O hang
> when submitting a REQ_FUA request to a request queue for a zoned block
> device for which FUA has been disabled (QUEUE_FLAG_FUA is not set).]
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  block/blk-mq.c | 24 +++++++++++++++++++++---
>  1 file changed, 21 insertions(+), 3 deletions(-)

Now queued up, thanks.

greg k-h

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 7ed6b9469f97..07610505c177 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -675,6 +675,22 @@  struct request *blk_mq_alloc_request_hctx(struct request_queue *q,
 }
 EXPORT_SYMBOL_GPL(blk_mq_alloc_request_hctx);
 
+static void blk_mq_finish_request(struct request *rq)
+{
+	struct request_queue *q = rq->q;
+
+	if ((rq->rq_flags & RQF_ELVPRIV) &&
+	    q->elevator->type->ops.finish_request) {
+		q->elevator->type->ops.finish_request(rq);
+		/*
+		 * For postflush request that may need to be
+		 * completed twice, we should clear this flag
+		 * to avoid double finish_request() on the rq.
+		 */
+		rq->rq_flags &= ~RQF_ELVPRIV;
+	}
+}
+
 static void __blk_mq_free_request(struct request *rq)
 {
 	struct request_queue *q = rq->q;
@@ -701,9 +717,7 @@  void blk_mq_free_request(struct request *rq)
 {
 	struct request_queue *q = rq->q;
 
-	if ((rq->rq_flags & RQF_ELVPRIV) &&
-	    q->elevator->type->ops.finish_request)
-		q->elevator->type->ops.finish_request(rq);
+	blk_mq_finish_request(rq);
 
 	if (unlikely(laptop_mode && !blk_rq_is_passthrough(rq)))
 		laptop_io_completion(q->disk->bdi);
@@ -1025,6 +1039,8 @@  inline void __blk_mq_end_request(struct request *rq, blk_status_t error)
 	if (blk_mq_need_time_stamp(rq))
 		__blk_mq_end_request_acct(rq, ktime_get_ns());
 
+	blk_mq_finish_request(rq);
+
 	if (rq->end_io) {
 		rq_qos_done(rq->q, rq);
 		if (rq->end_io(rq, error) == RQ_END_IO_FREE)
@@ -1079,6 +1095,8 @@  void blk_mq_end_request_batch(struct io_comp_batch *iob)
 		if (iob->need_ts)
 			__blk_mq_end_request_acct(rq, now);
 
+		blk_mq_finish_request(rq);
+
 		rq_qos_done(rq->q, rq);
 
 		/*

blk-mq: release scheduler resource when request completes

Commit Message

Comments

Patch