[1/4] block: add mq_ops->queue_rqs hook

Message ID	20211117033807.185715-2-axboe@kernel.dk (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-block-owner@kernel.org> From: Jens Axboe <axboe@kernel.dk> To: linux-block@vger.kernel.org Cc: hch@infradead.org, Jens Axboe <axboe@kernel.dk> Subject: [PATCH 1/4] block: add mq_ops->queue_rqs hook Date: Tue, 16 Nov 2021 20:38:04 -0700 Message-Id: <20211117033807.185715-2-axboe@kernel.dk> In-Reply-To: <20211117033807.185715-1-axboe@kernel.dk> References: <20211117033807.185715-1-axboe@kernel.dk> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	Add support for list issue \| expand [PATCHSET,0/4] Add support for list issue [1/4] block: add mq_ops->queue_rqs hook [2/4] nvme: split command copy into a helper [3/4] nvme: separate command prep and issue [4/4] nvme: add support for mq_ops->queue_rqs()

Jens Axboe Nov. 17, 2021, 3:38 a.m. UTC

If we have a list of requests in our plug list, send it to the driver in
one go, if possible. The driver must set mq_ops->queue_rqs() to support
this, if not the usual one-by-one path is used.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/blk-mq.c         | 17 +++++++++++++++++
 include/linux/blk-mq.h |  8 ++++++++
 2 files changed, 25 insertions(+)

Christoph Hellwig Nov. 17, 2021, 6:25 a.m. UTC | #1

> @@ -2208,6 +2208,19 @@ static void blk_mq_plug_issue_direct(struct blk_plug *plug, bool from_schedule)
>  	int queued = 0;
>  	int errors = 0;
>  
> +	/*
> +	 * Peek first request and see if we have a ->queue_rqs() hook. If we
> +	 * do, we can dispatch the whole plug list in one go. We already know
> +	 * at this point that all requests belong to the same queue, caller
> +	 * must ensure that's the case.
> +	 */
> +	rq = rq_list_peek(&plug->mq_list);
> +	if (rq->q->mq_ops->queue_rqs) {
> +		rq->q->mq_ops->queue_rqs(&plug->mq_list);
> +		if (rq_list_empty(plug->mq_list))
> +			return;
> +	}

I'd move this straight into the caller to simplify the follow, something
like the patch below.  Also I wonder if the slow path might want to
be moved from blk_mq_flush_plug_list into a separate helper to keep
the fast path as straight as possible?

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 3ab34c4f20daf..370f4b2ab4511 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2255,6 +2255,14 @@ void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule)
 	plug->rq_count = 0;
 
 	if (!plug->multiple_queues && !plug->has_elevator && !from_schedule) {
+		struct request_queue *q = rq_list_peek(&plug->mq_list)->q;
+
+		if (q->mq_ops->queue_rqs) {
+			q->mq_ops->queue_rqs(&plug->mq_list);
+			if (rq_list_empty(plug->mq_list))
+				return;
+		}
+
 		blk_mq_plug_issue_direct(plug, false);
 		if (rq_list_empty(plug->mq_list))
 			return;

Ming Lei Nov. 17, 2021, 8:20 a.m. UTC | #2

On Tue, Nov 16, 2021 at 08:38:04PM -0700, Jens Axboe wrote:
> If we have a list of requests in our plug list, send it to the driver in
> one go, if possible. The driver must set mq_ops->queue_rqs() to support
> this, if not the usual one-by-one path is used.
> 
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
>  block/blk-mq.c         | 17 +++++++++++++++++
>  include/linux/blk-mq.h |  8 ++++++++
>  2 files changed, 25 insertions(+)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 9b4e79e2ac1e..005715206b16 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -2208,6 +2208,19 @@ static void blk_mq_plug_issue_direct(struct blk_plug *plug, bool from_schedule)
>  	int queued = 0;
>  	int errors = 0;
>  
> +	/*
> +	 * Peek first request and see if we have a ->queue_rqs() hook. If we
> +	 * do, we can dispatch the whole plug list in one go. We already know
> +	 * at this point that all requests belong to the same queue, caller
> +	 * must ensure that's the case.
> +	 */
> +	rq = rq_list_peek(&plug->mq_list);
> +	if (rq->q->mq_ops->queue_rqs) {
> +		rq->q->mq_ops->queue_rqs(&plug->mq_list);
> +		if (rq_list_empty(plug->mq_list))
> +			return;
> +	}
> +

Then BLK_MQ_F_TAG_QUEUE_SHARED isn't handled as before for multiple NVMe
NS.


thanks,
Ming

Jens Axboe Nov. 17, 2021, 3:41 p.m. UTC | #3

On 11/16/21 11:25 PM, Christoph Hellwig wrote:
>> @@ -2208,6 +2208,19 @@ static void blk_mq_plug_issue_direct(struct blk_plug *plug, bool from_schedule)
>>  	int queued = 0;
>>  	int errors = 0;
>>  
>> +	/*
>> +	 * Peek first request and see if we have a ->queue_rqs() hook. If we
>> +	 * do, we can dispatch the whole plug list in one go. We already know
>> +	 * at this point that all requests belong to the same queue, caller
>> +	 * must ensure that's the case.
>> +	 */
>> +	rq = rq_list_peek(&plug->mq_list);
>> +	if (rq->q->mq_ops->queue_rqs) {
>> +		rq->q->mq_ops->queue_rqs(&plug->mq_list);
>> +		if (rq_list_empty(plug->mq_list))
>> +			return;
>> +	}
> 
> I'd move this straight into the caller to simplify the follow, something
> like the patch below.  Also I wonder if the slow path might want to
> be moved from blk_mq_flush_plug_list into a separate helper to keep
> the fast path as straight as possible?

Yes, I think that's better. I'll fold that in.

Jens Axboe Nov. 17, 2021, 3:43 p.m. UTC | #4

On 11/17/21 1:20 AM, Ming Lei wrote:
> On Tue, Nov 16, 2021 at 08:38:04PM -0700, Jens Axboe wrote:
>> If we have a list of requests in our plug list, send it to the driver in
>> one go, if possible. The driver must set mq_ops->queue_rqs() to support
>> this, if not the usual one-by-one path is used.
>>
>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>> ---
>>  block/blk-mq.c         | 17 +++++++++++++++++
>>  include/linux/blk-mq.h |  8 ++++++++
>>  2 files changed, 25 insertions(+)
>>
>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>> index 9b4e79e2ac1e..005715206b16 100644
>> --- a/block/blk-mq.c
>> +++ b/block/blk-mq.c
>> @@ -2208,6 +2208,19 @@ static void blk_mq_plug_issue_direct(struct blk_plug *plug, bool from_schedule)
>>  	int queued = 0;
>>  	int errors = 0;
>>  
>> +	/*
>> +	 * Peek first request and see if we have a ->queue_rqs() hook. If we
>> +	 * do, we can dispatch the whole plug list in one go. We already know
>> +	 * at this point that all requests belong to the same queue, caller
>> +	 * must ensure that's the case.
>> +	 */
>> +	rq = rq_list_peek(&plug->mq_list);
>> +	if (rq->q->mq_ops->queue_rqs) {
>> +		rq->q->mq_ops->queue_rqs(&plug->mq_list);
>> +		if (rq_list_empty(plug->mq_list))
>> +			return;
>> +	}
>> +
> 
> Then BLK_MQ_F_TAG_QUEUE_SHARED isn't handled as before for multiple NVMe
> NS.

Can you expand? If we have multiple namespaces in the plug list, we have
multiple queues. There's no direct issue of the list if that's the case.
Or maybe I'm missing what you mean here?

Keith Busch Nov. 17, 2021, 8:41 p.m. UTC | #5

On Tue, Nov 16, 2021 at 08:38:04PM -0700, Jens Axboe wrote:
> If we have a list of requests in our plug list, send it to the driver in
> one go, if possible. The driver must set mq_ops->queue_rqs() to support
> this, if not the usual one-by-one path is used.

It looks like we still need to sync with the request_queue quiesce flag.
Something like the following (untested) on top of this patch should do
it:

---
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 688ebf6a7a7b..447d0b77375d 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -263,6 +263,9 @@ void blk_mq_wait_quiesce_done(struct request_queue *q)
 	unsigned int i;
 	bool rcu = false;
 
+	if (q->tag_set->flags & BLK_MQ_F_BLOCKING)
+		synchronize_srcu(q->srcu);
+
 	queue_for_each_hw_ctx(q, hctx, i) {
 		if (hctx->flags & BLK_MQ_F_BLOCKING)
 			synchronize_srcu(hctx->srcu);
@@ -2201,6 +2204,25 @@ static void blk_mq_commit_rqs(struct blk_mq_hw_ctx *hctx, int *queued,
 	*queued = 0;
 }
 
+static void queue_lock(struct request_queue *q, int *srcu_idx)
+	__acquires(q->srcu)
+{
+	if (!(q->tag_set->flags & BLK_MQ_F_BLOCKING)) {
+		*srcu_idx = 0;
+		rcu_read_lock();
+	} else
+		*srcu_idx = srcu_read_lock(q->srcu);
+}
+
+static void queue_unlock(struct request_queue *q, int srcu_idx)
+	__releases(q->srcu)
+{
+	if (!(q->tag_set->flags & BLK_MQ_F_BLOCKING))
+		rcu_read_unlock();
+	else
+		srcu_read_unlock(q->srcu, srcu_idx);
+}
+
 static void blk_mq_plug_issue_direct(struct blk_plug *plug, bool from_schedule)
 {
 	struct blk_mq_hw_ctx *hctx = NULL;
@@ -2216,7 +2238,14 @@ static void blk_mq_plug_issue_direct(struct blk_plug *plug, bool from_schedule)
 	 */
 	rq = rq_list_peek(&plug->mq_list);
 	if (rq->q->mq_ops->queue_rqs) {
-		rq->q->mq_ops->queue_rqs(&plug->mq_list);
+		struct request_queue *q = rq->q;
+		int srcu_idx;
+
+		queue_lock(q, &srcu_idx);
+		if (!blk_queue_quiesced(q))
+			q->mq_ops->queue_rqs(&plug->mq_list);
+		queue_unlock(q, srcu_idx);
+
 		if (rq_list_empty(plug->mq_list))
 			return;
 	}
@@ -3727,6 +3756,8 @@ int blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 	blk_queue_rq_timeout(q, set->timeout ? set->timeout : 30 * HZ);
 
 	q->tag_set = set;
+	if (set->flags & BLK_MQ_F_BLOCKING)
+		init_srcu_struct(q->srcu);
 
 	q->queue_flags |= QUEUE_FLAG_MQ_DEFAULT;
 	if (set->nr_maps > HCTX_TYPE_POLL &&
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index bd4370baccca..ae7591dc9cbb 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -373,6 +373,8 @@ struct request_queue {
 	 * devices that do not have multiple independent access ranges.
 	 */
 	struct blk_independent_access_ranges *ia_ranges;
+
+	struct srcu_struct	srcu[];
 };
 
 /* Keep blk_queue_flag_name[] in sync with the definitions below */
--

Keith Busch Nov. 17, 2021, 8:48 p.m. UTC | #6

On Wed, Nov 17, 2021 at 08:43:02AM -0700, Jens Axboe wrote:
> On 11/17/21 1:20 AM, Ming Lei wrote:
> > On Tue, Nov 16, 2021 at 08:38:04PM -0700, Jens Axboe wrote:
> >> If we have a list of requests in our plug list, send it to the driver in
> >> one go, if possible. The driver must set mq_ops->queue_rqs() to support
> >> this, if not the usual one-by-one path is used.
> >>
> >> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> >> ---
> >>  block/blk-mq.c         | 17 +++++++++++++++++
> >>  include/linux/blk-mq.h |  8 ++++++++
> >>  2 files changed, 25 insertions(+)
> >>
> >> diff --git a/block/blk-mq.c b/block/blk-mq.c
> >> index 9b4e79e2ac1e..005715206b16 100644
> >> --- a/block/blk-mq.c
> >> +++ b/block/blk-mq.c
> >> @@ -2208,6 +2208,19 @@ static void blk_mq_plug_issue_direct(struct blk_plug *plug, bool from_schedule)
> >>  	int queued = 0;
> >>  	int errors = 0;
> >>  
> >> +	/*
> >> +	 * Peek first request and see if we have a ->queue_rqs() hook. If we
> >> +	 * do, we can dispatch the whole plug list in one go. We already know
> >> +	 * at this point that all requests belong to the same queue, caller
> >> +	 * must ensure that's the case.
> >> +	 */
> >> +	rq = rq_list_peek(&plug->mq_list);
> >> +	if (rq->q->mq_ops->queue_rqs) {
> >> +		rq->q->mq_ops->queue_rqs(&plug->mq_list);
> >> +		if (rq_list_empty(plug->mq_list))
> >> +			return;
> >> +	}
> >> +
> > 
> > Then BLK_MQ_F_TAG_QUEUE_SHARED isn't handled as before for multiple NVMe
> > NS.
> 
> Can you expand? If we have multiple namespaces in the plug list, we have
> multiple queues. There's no direct issue of the list if that's the case.
> Or maybe I'm missing what you mean here?

If the plug list only has requests for one namespace, I think Ming is
referring to the special accounting for BLK_MQ_F_TAG_QUEUE_SHARED in
__blk_mq_get_driver_tag() that normally gets called before dispatching
to the driver, but isn't getting called when using .queue_rqs().

Ming Lei Nov. 17, 2021, 11:59 p.m. UTC | #7

On Wed, Nov 17, 2021 at 12:48:14PM -0800, Keith Busch wrote:
> On Wed, Nov 17, 2021 at 08:43:02AM -0700, Jens Axboe wrote:
> > On 11/17/21 1:20 AM, Ming Lei wrote:
> > > On Tue, Nov 16, 2021 at 08:38:04PM -0700, Jens Axboe wrote:
> > >> If we have a list of requests in our plug list, send it to the driver in
> > >> one go, if possible. The driver must set mq_ops->queue_rqs() to support
> > >> this, if not the usual one-by-one path is used.
> > >>
> > >> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> > >> ---
> > >>  block/blk-mq.c         | 17 +++++++++++++++++
> > >>  include/linux/blk-mq.h |  8 ++++++++
> > >>  2 files changed, 25 insertions(+)
> > >>
> > >> diff --git a/block/blk-mq.c b/block/blk-mq.c
> > >> index 9b4e79e2ac1e..005715206b16 100644
> > >> --- a/block/blk-mq.c
> > >> +++ b/block/blk-mq.c
> > >> @@ -2208,6 +2208,19 @@ static void blk_mq_plug_issue_direct(struct blk_plug *plug, bool from_schedule)
> > >>  	int queued = 0;
> > >>  	int errors = 0;
> > >>  
> > >> +	/*
> > >> +	 * Peek first request and see if we have a ->queue_rqs() hook. If we
> > >> +	 * do, we can dispatch the whole plug list in one go. We already know
> > >> +	 * at this point that all requests belong to the same queue, caller
> > >> +	 * must ensure that's the case.
> > >> +	 */
> > >> +	rq = rq_list_peek(&plug->mq_list);
> > >> +	if (rq->q->mq_ops->queue_rqs) {
> > >> +		rq->q->mq_ops->queue_rqs(&plug->mq_list);
> > >> +		if (rq_list_empty(plug->mq_list))
> > >> +			return;
> > >> +	}
> > >> +
> > > 
> > > Then BLK_MQ_F_TAG_QUEUE_SHARED isn't handled as before for multiple NVMe
> > > NS.
> > 
> > Can you expand? If we have multiple namespaces in the plug list, we have
> > multiple queues. There's no direct issue of the list if that's the case.
> > Or maybe I'm missing what you mean here?
> 
> If the plug list only has requests for one namespace, I think Ming is
> referring to the special accounting for BLK_MQ_F_TAG_QUEUE_SHARED in
> __blk_mq_get_driver_tag() that normally gets called before dispatching
> to the driver, but isn't getting called when using .queue_rqs().

Yeah, that is it. This is one normal case, each task runs I/O on
different namespace, but all NSs share/contend the single host tags,
BLK_MQ_F_TAG_QUEUE_SHARED is supposed to provide fair allocation among
all these NSs.


thanks, 
Ming

Ming Lei Nov. 18, 2021, 12:18 a.m. UTC | #8

On Wed, Nov 17, 2021 at 12:41:01PM -0800, Keith Busch wrote:
> On Tue, Nov 16, 2021 at 08:38:04PM -0700, Jens Axboe wrote:
> > If we have a list of requests in our plug list, send it to the driver in
> > one go, if possible. The driver must set mq_ops->queue_rqs() to support
> > this, if not the usual one-by-one path is used.
> 
> It looks like we still need to sync with the request_queue quiesce flag.

I think this approach is good.

> Something like the following (untested) on top of this patch should do
> it:
> 
> ---
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 688ebf6a7a7b..447d0b77375d 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -263,6 +263,9 @@ void blk_mq_wait_quiesce_done(struct request_queue *q)
>  	unsigned int i;
>  	bool rcu = false;
>  
> +	if (q->tag_set->flags & BLK_MQ_F_BLOCKING)
> +		synchronize_srcu(q->srcu);
> +
>  	queue_for_each_hw_ctx(q, hctx, i) {
>  		if (hctx->flags & BLK_MQ_F_BLOCKING)
>  			synchronize_srcu(hctx->srcu);
> @@ -2201,6 +2204,25 @@ static void blk_mq_commit_rqs(struct blk_mq_hw_ctx *hctx, int *queued,
>  	*queued = 0;
>  }
>  
> +static void queue_lock(struct request_queue *q, int *srcu_idx)
> +	__acquires(q->srcu)
> +{
> +	if (!(q->tag_set->flags & BLK_MQ_F_BLOCKING)) {
> +		*srcu_idx = 0;
> +		rcu_read_lock();
> +	} else
> +		*srcu_idx = srcu_read_lock(q->srcu);
> +}
> +
> +static void queue_unlock(struct request_queue *q, int srcu_idx)
> +	__releases(q->srcu)
> +{
> +	if (!(q->tag_set->flags & BLK_MQ_F_BLOCKING))
> +		rcu_read_unlock();
> +	else
> +		srcu_read_unlock(q->srcu, srcu_idx);
> +}
> +
>  static void blk_mq_plug_issue_direct(struct blk_plug *plug, bool from_schedule)
>  {
>  	struct blk_mq_hw_ctx *hctx = NULL;
> @@ -2216,7 +2238,14 @@ static void blk_mq_plug_issue_direct(struct blk_plug *plug, bool from_schedule)
>  	 */
>  	rq = rq_list_peek(&plug->mq_list);
>  	if (rq->q->mq_ops->queue_rqs) {
> -		rq->q->mq_ops->queue_rqs(&plug->mq_list);
> +		struct request_queue *q = rq->q;
> +		int srcu_idx;
> +
> +		queue_lock(q, &srcu_idx);
> +		if (!blk_queue_quiesced(q))
> +			q->mq_ops->queue_rqs(&plug->mq_list);
> +		queue_unlock(q, srcu_idx);
> +
>  		if (rq_list_empty(plug->mq_list))
>  			return;
>  	}
> @@ -3727,6 +3756,8 @@ int blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
>  	blk_queue_rq_timeout(q, set->timeout ? set->timeout : 30 * HZ);
>  
>  	q->tag_set = set;
> +	if (set->flags & BLK_MQ_F_BLOCKING)
> +		init_srcu_struct(q->srcu);
>  
>  	q->queue_flags |= QUEUE_FLAG_MQ_DEFAULT;
>  	if (set->nr_maps > HCTX_TYPE_POLL &&
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index bd4370baccca..ae7591dc9cbb 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -373,6 +373,8 @@ struct request_queue {
>  	 * devices that do not have multiple independent access ranges.
>  	 */
>  	struct blk_independent_access_ranges *ia_ranges;
> +
> +	struct srcu_struct	srcu[];

Basically it is same with my previous post[1], but the above patch doesn't
handle request queue allocation/freeing correctly in case of BLK_MQ_F_BLOCKING.

[1] https://lore.kernel.org/linux-block/20211103160018.3764976-1-ming.lei@redhat.com/


Thanks,
Ming

Keith Busch Nov. 18, 2021, 2:02 a.m. UTC | #9

On Thu, Nov 18, 2021 at 08:18:29AM +0800, Ming Lei wrote:
> On Wed, Nov 17, 2021 at 12:41:01PM -0800, Keith Busch wrote:
> > On Tue, Nov 16, 2021 at 08:38:04PM -0700, Jens Axboe wrote:
> > > If we have a list of requests in our plug list, send it to the driver in
> > > one go, if possible. The driver must set mq_ops->queue_rqs() to support
> > > this, if not the usual one-by-one path is used.
> > 
> > It looks like we still need to sync with the request_queue quiesce flag.
> 
> Basically it is same with my previous post[1], but the above patch doesn't
> handle request queue allocation/freeing correctly in case of BLK_MQ_F_BLOCKING.

Thanks for the pointer. I also thought it may be just as well that blocking
dispatchers don't get to use the optimized rqlist queueing, which would
simplify quiesce to the normal rcu usage.

Ming Lei Nov. 18, 2021, 2:14 a.m. UTC | #10

On Wed, Nov 17, 2021 at 07:02:27PM -0700, Keith Busch wrote:
> On Thu, Nov 18, 2021 at 08:18:29AM +0800, Ming Lei wrote:
> > On Wed, Nov 17, 2021 at 12:41:01PM -0800, Keith Busch wrote:
> > > On Tue, Nov 16, 2021 at 08:38:04PM -0700, Jens Axboe wrote:
> > > > If we have a list of requests in our plug list, send it to the driver in
> > > > one go, if possible. The driver must set mq_ops->queue_rqs() to support
> > > > this, if not the usual one-by-one path is used.
> > > 
> > > It looks like we still need to sync with the request_queue quiesce flag.
> > 
> > Basically it is same with my previous post[1], but the above patch doesn't
> > handle request queue allocation/freeing correctly in case of BLK_MQ_F_BLOCKING.
> 
> Thanks for the pointer. I also thought it may be just as well that blocking
> dispatchers don't get to use the optimized rqlist queueing,

Yeah, that is fine, but nvme-tcp may not benefit from the ->queue_rq() optimization.

> which would simplify quiesce to the normal rcu usage.

Normal dispatch code still need to check quiesced for blocking queue, so
srcu is still needed, not see easier way than srcu yet. Years ago, I
tried to replace the srcu with percpu-refcnt, which is still more
complicated than srcu, but srcu_index can be dropped.


Thanks,
Ming

[1/4] block: add mq_ops->queue_rqs hook

Commit Message

Comments

Patch