Message ID | 20211117033807.185715-2-axboe@kernel.dk (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Add support for list issue | expand |
> @@ -2208,6 +2208,19 @@ static void blk_mq_plug_issue_direct(struct blk_plug *plug, bool from_schedule) > int queued = 0; > int errors = 0; > > + /* > + * Peek first request and see if we have a ->queue_rqs() hook. If we > + * do, we can dispatch the whole plug list in one go. We already know > + * at this point that all requests belong to the same queue, caller > + * must ensure that's the case. > + */ > + rq = rq_list_peek(&plug->mq_list); > + if (rq->q->mq_ops->queue_rqs) { > + rq->q->mq_ops->queue_rqs(&plug->mq_list); > + if (rq_list_empty(plug->mq_list)) > + return; > + } I'd move this straight into the caller to simplify the follow, something like the patch below. Also I wonder if the slow path might want to be moved from blk_mq_flush_plug_list into a separate helper to keep the fast path as straight as possible? diff --git a/block/blk-mq.c b/block/blk-mq.c index 3ab34c4f20daf..370f4b2ab4511 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -2255,6 +2255,14 @@ void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule) plug->rq_count = 0; if (!plug->multiple_queues && !plug->has_elevator && !from_schedule) { + struct request_queue *q = rq_list_peek(&plug->mq_list)->q; + + if (q->mq_ops->queue_rqs) { + q->mq_ops->queue_rqs(&plug->mq_list); + if (rq_list_empty(plug->mq_list)) + return; + } + blk_mq_plug_issue_direct(plug, false); if (rq_list_empty(plug->mq_list)) return;
On Tue, Nov 16, 2021 at 08:38:04PM -0700, Jens Axboe wrote: > If we have a list of requests in our plug list, send it to the driver in > one go, if possible. The driver must set mq_ops->queue_rqs() to support > this, if not the usual one-by-one path is used. > > Signed-off-by: Jens Axboe <axboe@kernel.dk> > --- > block/blk-mq.c | 17 +++++++++++++++++ > include/linux/blk-mq.h | 8 ++++++++ > 2 files changed, 25 insertions(+) > > diff --git a/block/blk-mq.c b/block/blk-mq.c > index 9b4e79e2ac1e..005715206b16 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -2208,6 +2208,19 @@ static void blk_mq_plug_issue_direct(struct blk_plug *plug, bool from_schedule) > int queued = 0; > int errors = 0; > > + /* > + * Peek first request and see if we have a ->queue_rqs() hook. If we > + * do, we can dispatch the whole plug list in one go. We already know > + * at this point that all requests belong to the same queue, caller > + * must ensure that's the case. > + */ > + rq = rq_list_peek(&plug->mq_list); > + if (rq->q->mq_ops->queue_rqs) { > + rq->q->mq_ops->queue_rqs(&plug->mq_list); > + if (rq_list_empty(plug->mq_list)) > + return; > + } > + Then BLK_MQ_F_TAG_QUEUE_SHARED isn't handled as before for multiple NVMe NS. thanks, Ming
On 11/16/21 11:25 PM, Christoph Hellwig wrote: >> @@ -2208,6 +2208,19 @@ static void blk_mq_plug_issue_direct(struct blk_plug *plug, bool from_schedule) >> int queued = 0; >> int errors = 0; >> >> + /* >> + * Peek first request and see if we have a ->queue_rqs() hook. If we >> + * do, we can dispatch the whole plug list in one go. We already know >> + * at this point that all requests belong to the same queue, caller >> + * must ensure that's the case. >> + */ >> + rq = rq_list_peek(&plug->mq_list); >> + if (rq->q->mq_ops->queue_rqs) { >> + rq->q->mq_ops->queue_rqs(&plug->mq_list); >> + if (rq_list_empty(plug->mq_list)) >> + return; >> + } > > I'd move this straight into the caller to simplify the follow, something > like the patch below. Also I wonder if the slow path might want to > be moved from blk_mq_flush_plug_list into a separate helper to keep > the fast path as straight as possible? Yes, I think that's better. I'll fold that in.
On 11/17/21 1:20 AM, Ming Lei wrote: > On Tue, Nov 16, 2021 at 08:38:04PM -0700, Jens Axboe wrote: >> If we have a list of requests in our plug list, send it to the driver in >> one go, if possible. The driver must set mq_ops->queue_rqs() to support >> this, if not the usual one-by-one path is used. >> >> Signed-off-by: Jens Axboe <axboe@kernel.dk> >> --- >> block/blk-mq.c | 17 +++++++++++++++++ >> include/linux/blk-mq.h | 8 ++++++++ >> 2 files changed, 25 insertions(+) >> >> diff --git a/block/blk-mq.c b/block/blk-mq.c >> index 9b4e79e2ac1e..005715206b16 100644 >> --- a/block/blk-mq.c >> +++ b/block/blk-mq.c >> @@ -2208,6 +2208,19 @@ static void blk_mq_plug_issue_direct(struct blk_plug *plug, bool from_schedule) >> int queued = 0; >> int errors = 0; >> >> + /* >> + * Peek first request and see if we have a ->queue_rqs() hook. If we >> + * do, we can dispatch the whole plug list in one go. We already know >> + * at this point that all requests belong to the same queue, caller >> + * must ensure that's the case. >> + */ >> + rq = rq_list_peek(&plug->mq_list); >> + if (rq->q->mq_ops->queue_rqs) { >> + rq->q->mq_ops->queue_rqs(&plug->mq_list); >> + if (rq_list_empty(plug->mq_list)) >> + return; >> + } >> + > > Then BLK_MQ_F_TAG_QUEUE_SHARED isn't handled as before for multiple NVMe > NS. Can you expand? If we have multiple namespaces in the plug list, we have multiple queues. There's no direct issue of the list if that's the case. Or maybe I'm missing what you mean here?
On Tue, Nov 16, 2021 at 08:38:04PM -0700, Jens Axboe wrote: > If we have a list of requests in our plug list, send it to the driver in > one go, if possible. The driver must set mq_ops->queue_rqs() to support > this, if not the usual one-by-one path is used. It looks like we still need to sync with the request_queue quiesce flag. Something like the following (untested) on top of this patch should do it: --- diff --git a/block/blk-mq.c b/block/blk-mq.c index 688ebf6a7a7b..447d0b77375d 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -263,6 +263,9 @@ void blk_mq_wait_quiesce_done(struct request_queue *q) unsigned int i; bool rcu = false; + if (q->tag_set->flags & BLK_MQ_F_BLOCKING) + synchronize_srcu(q->srcu); + queue_for_each_hw_ctx(q, hctx, i) { if (hctx->flags & BLK_MQ_F_BLOCKING) synchronize_srcu(hctx->srcu); @@ -2201,6 +2204,25 @@ static void blk_mq_commit_rqs(struct blk_mq_hw_ctx *hctx, int *queued, *queued = 0; } +static void queue_lock(struct request_queue *q, int *srcu_idx) + __acquires(q->srcu) +{ + if (!(q->tag_set->flags & BLK_MQ_F_BLOCKING)) { + *srcu_idx = 0; + rcu_read_lock(); + } else + *srcu_idx = srcu_read_lock(q->srcu); +} + +static void queue_unlock(struct request_queue *q, int srcu_idx) + __releases(q->srcu) +{ + if (!(q->tag_set->flags & BLK_MQ_F_BLOCKING)) + rcu_read_unlock(); + else + srcu_read_unlock(q->srcu, srcu_idx); +} + static void blk_mq_plug_issue_direct(struct blk_plug *plug, bool from_schedule) { struct blk_mq_hw_ctx *hctx = NULL; @@ -2216,7 +2238,14 @@ static void blk_mq_plug_issue_direct(struct blk_plug *plug, bool from_schedule) */ rq = rq_list_peek(&plug->mq_list); if (rq->q->mq_ops->queue_rqs) { - rq->q->mq_ops->queue_rqs(&plug->mq_list); + struct request_queue *q = rq->q; + int srcu_idx; + + queue_lock(q, &srcu_idx); + if (!blk_queue_quiesced(q)) + q->mq_ops->queue_rqs(&plug->mq_list); + queue_unlock(q, srcu_idx); + if (rq_list_empty(plug->mq_list)) return; } @@ -3727,6 +3756,8 @@ int blk_mq_init_allocated_queue(struct blk_mq_tag_set *set, blk_queue_rq_timeout(q, set->timeout ? set->timeout : 30 * HZ); q->tag_set = set; + if (set->flags & BLK_MQ_F_BLOCKING) + init_srcu_struct(q->srcu); q->queue_flags |= QUEUE_FLAG_MQ_DEFAULT; if (set->nr_maps > HCTX_TYPE_POLL && diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index bd4370baccca..ae7591dc9cbb 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -373,6 +373,8 @@ struct request_queue { * devices that do not have multiple independent access ranges. */ struct blk_independent_access_ranges *ia_ranges; + + struct srcu_struct srcu[]; }; /* Keep blk_queue_flag_name[] in sync with the definitions below */ --
On Wed, Nov 17, 2021 at 08:43:02AM -0700, Jens Axboe wrote: > On 11/17/21 1:20 AM, Ming Lei wrote: > > On Tue, Nov 16, 2021 at 08:38:04PM -0700, Jens Axboe wrote: > >> If we have a list of requests in our plug list, send it to the driver in > >> one go, if possible. The driver must set mq_ops->queue_rqs() to support > >> this, if not the usual one-by-one path is used. > >> > >> Signed-off-by: Jens Axboe <axboe@kernel.dk> > >> --- > >> block/blk-mq.c | 17 +++++++++++++++++ > >> include/linux/blk-mq.h | 8 ++++++++ > >> 2 files changed, 25 insertions(+) > >> > >> diff --git a/block/blk-mq.c b/block/blk-mq.c > >> index 9b4e79e2ac1e..005715206b16 100644 > >> --- a/block/blk-mq.c > >> +++ b/block/blk-mq.c > >> @@ -2208,6 +2208,19 @@ static void blk_mq_plug_issue_direct(struct blk_plug *plug, bool from_schedule) > >> int queued = 0; > >> int errors = 0; > >> > >> + /* > >> + * Peek first request and see if we have a ->queue_rqs() hook. If we > >> + * do, we can dispatch the whole plug list in one go. We already know > >> + * at this point that all requests belong to the same queue, caller > >> + * must ensure that's the case. > >> + */ > >> + rq = rq_list_peek(&plug->mq_list); > >> + if (rq->q->mq_ops->queue_rqs) { > >> + rq->q->mq_ops->queue_rqs(&plug->mq_list); > >> + if (rq_list_empty(plug->mq_list)) > >> + return; > >> + } > >> + > > > > Then BLK_MQ_F_TAG_QUEUE_SHARED isn't handled as before for multiple NVMe > > NS. > > Can you expand? If we have multiple namespaces in the plug list, we have > multiple queues. There's no direct issue of the list if that's the case. > Or maybe I'm missing what you mean here? If the plug list only has requests for one namespace, I think Ming is referring to the special accounting for BLK_MQ_F_TAG_QUEUE_SHARED in __blk_mq_get_driver_tag() that normally gets called before dispatching to the driver, but isn't getting called when using .queue_rqs().
On Wed, Nov 17, 2021 at 12:48:14PM -0800, Keith Busch wrote: > On Wed, Nov 17, 2021 at 08:43:02AM -0700, Jens Axboe wrote: > > On 11/17/21 1:20 AM, Ming Lei wrote: > > > On Tue, Nov 16, 2021 at 08:38:04PM -0700, Jens Axboe wrote: > > >> If we have a list of requests in our plug list, send it to the driver in > > >> one go, if possible. The driver must set mq_ops->queue_rqs() to support > > >> this, if not the usual one-by-one path is used. > > >> > > >> Signed-off-by: Jens Axboe <axboe@kernel.dk> > > >> --- > > >> block/blk-mq.c | 17 +++++++++++++++++ > > >> include/linux/blk-mq.h | 8 ++++++++ > > >> 2 files changed, 25 insertions(+) > > >> > > >> diff --git a/block/blk-mq.c b/block/blk-mq.c > > >> index 9b4e79e2ac1e..005715206b16 100644 > > >> --- a/block/blk-mq.c > > >> +++ b/block/blk-mq.c > > >> @@ -2208,6 +2208,19 @@ static void blk_mq_plug_issue_direct(struct blk_plug *plug, bool from_schedule) > > >> int queued = 0; > > >> int errors = 0; > > >> > > >> + /* > > >> + * Peek first request and see if we have a ->queue_rqs() hook. If we > > >> + * do, we can dispatch the whole plug list in one go. We already know > > >> + * at this point that all requests belong to the same queue, caller > > >> + * must ensure that's the case. > > >> + */ > > >> + rq = rq_list_peek(&plug->mq_list); > > >> + if (rq->q->mq_ops->queue_rqs) { > > >> + rq->q->mq_ops->queue_rqs(&plug->mq_list); > > >> + if (rq_list_empty(plug->mq_list)) > > >> + return; > > >> + } > > >> + > > > > > > Then BLK_MQ_F_TAG_QUEUE_SHARED isn't handled as before for multiple NVMe > > > NS. > > > > Can you expand? If we have multiple namespaces in the plug list, we have > > multiple queues. There's no direct issue of the list if that's the case. > > Or maybe I'm missing what you mean here? > > If the plug list only has requests for one namespace, I think Ming is > referring to the special accounting for BLK_MQ_F_TAG_QUEUE_SHARED in > __blk_mq_get_driver_tag() that normally gets called before dispatching > to the driver, but isn't getting called when using .queue_rqs(). Yeah, that is it. This is one normal case, each task runs I/O on different namespace, but all NSs share/contend the single host tags, BLK_MQ_F_TAG_QUEUE_SHARED is supposed to provide fair allocation among all these NSs. thanks, Ming
On Wed, Nov 17, 2021 at 12:41:01PM -0800, Keith Busch wrote: > On Tue, Nov 16, 2021 at 08:38:04PM -0700, Jens Axboe wrote: > > If we have a list of requests in our plug list, send it to the driver in > > one go, if possible. The driver must set mq_ops->queue_rqs() to support > > this, if not the usual one-by-one path is used. > > It looks like we still need to sync with the request_queue quiesce flag. I think this approach is good. > Something like the following (untested) on top of this patch should do > it: > > --- > diff --git a/block/blk-mq.c b/block/blk-mq.c > index 688ebf6a7a7b..447d0b77375d 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -263,6 +263,9 @@ void blk_mq_wait_quiesce_done(struct request_queue *q) > unsigned int i; > bool rcu = false; > > + if (q->tag_set->flags & BLK_MQ_F_BLOCKING) > + synchronize_srcu(q->srcu); > + > queue_for_each_hw_ctx(q, hctx, i) { > if (hctx->flags & BLK_MQ_F_BLOCKING) > synchronize_srcu(hctx->srcu); > @@ -2201,6 +2204,25 @@ static void blk_mq_commit_rqs(struct blk_mq_hw_ctx *hctx, int *queued, > *queued = 0; > } > > +static void queue_lock(struct request_queue *q, int *srcu_idx) > + __acquires(q->srcu) > +{ > + if (!(q->tag_set->flags & BLK_MQ_F_BLOCKING)) { > + *srcu_idx = 0; > + rcu_read_lock(); > + } else > + *srcu_idx = srcu_read_lock(q->srcu); > +} > + > +static void queue_unlock(struct request_queue *q, int srcu_idx) > + __releases(q->srcu) > +{ > + if (!(q->tag_set->flags & BLK_MQ_F_BLOCKING)) > + rcu_read_unlock(); > + else > + srcu_read_unlock(q->srcu, srcu_idx); > +} > + > static void blk_mq_plug_issue_direct(struct blk_plug *plug, bool from_schedule) > { > struct blk_mq_hw_ctx *hctx = NULL; > @@ -2216,7 +2238,14 @@ static void blk_mq_plug_issue_direct(struct blk_plug *plug, bool from_schedule) > */ > rq = rq_list_peek(&plug->mq_list); > if (rq->q->mq_ops->queue_rqs) { > - rq->q->mq_ops->queue_rqs(&plug->mq_list); > + struct request_queue *q = rq->q; > + int srcu_idx; > + > + queue_lock(q, &srcu_idx); > + if (!blk_queue_quiesced(q)) > + q->mq_ops->queue_rqs(&plug->mq_list); > + queue_unlock(q, srcu_idx); > + > if (rq_list_empty(plug->mq_list)) > return; > } > @@ -3727,6 +3756,8 @@ int blk_mq_init_allocated_queue(struct blk_mq_tag_set *set, > blk_queue_rq_timeout(q, set->timeout ? set->timeout : 30 * HZ); > > q->tag_set = set; > + if (set->flags & BLK_MQ_F_BLOCKING) > + init_srcu_struct(q->srcu); > > q->queue_flags |= QUEUE_FLAG_MQ_DEFAULT; > if (set->nr_maps > HCTX_TYPE_POLL && > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h > index bd4370baccca..ae7591dc9cbb 100644 > --- a/include/linux/blkdev.h > +++ b/include/linux/blkdev.h > @@ -373,6 +373,8 @@ struct request_queue { > * devices that do not have multiple independent access ranges. > */ > struct blk_independent_access_ranges *ia_ranges; > + > + struct srcu_struct srcu[]; Basically it is same with my previous post[1], but the above patch doesn't handle request queue allocation/freeing correctly in case of BLK_MQ_F_BLOCKING. [1] https://lore.kernel.org/linux-block/20211103160018.3764976-1-ming.lei@redhat.com/ Thanks, Ming
On Thu, Nov 18, 2021 at 08:18:29AM +0800, Ming Lei wrote: > On Wed, Nov 17, 2021 at 12:41:01PM -0800, Keith Busch wrote: > > On Tue, Nov 16, 2021 at 08:38:04PM -0700, Jens Axboe wrote: > > > If we have a list of requests in our plug list, send it to the driver in > > > one go, if possible. The driver must set mq_ops->queue_rqs() to support > > > this, if not the usual one-by-one path is used. > > > > It looks like we still need to sync with the request_queue quiesce flag. > > Basically it is same with my previous post[1], but the above patch doesn't > handle request queue allocation/freeing correctly in case of BLK_MQ_F_BLOCKING. Thanks for the pointer. I also thought it may be just as well that blocking dispatchers don't get to use the optimized rqlist queueing, which would simplify quiesce to the normal rcu usage.
On Wed, Nov 17, 2021 at 07:02:27PM -0700, Keith Busch wrote: > On Thu, Nov 18, 2021 at 08:18:29AM +0800, Ming Lei wrote: > > On Wed, Nov 17, 2021 at 12:41:01PM -0800, Keith Busch wrote: > > > On Tue, Nov 16, 2021 at 08:38:04PM -0700, Jens Axboe wrote: > > > > If we have a list of requests in our plug list, send it to the driver in > > > > one go, if possible. The driver must set mq_ops->queue_rqs() to support > > > > this, if not the usual one-by-one path is used. > > > > > > It looks like we still need to sync with the request_queue quiesce flag. > > > > Basically it is same with my previous post[1], but the above patch doesn't > > handle request queue allocation/freeing correctly in case of BLK_MQ_F_BLOCKING. > > Thanks for the pointer. I also thought it may be just as well that blocking > dispatchers don't get to use the optimized rqlist queueing, Yeah, that is fine, but nvme-tcp may not benefit from the ->queue_rq() optimization. > which would simplify quiesce to the normal rcu usage. Normal dispatch code still need to check quiesced for blocking queue, so srcu is still needed, not see easier way than srcu yet. Years ago, I tried to replace the srcu with percpu-refcnt, which is still more complicated than srcu, but srcu_index can be dropped. Thanks, Ming
diff --git a/block/blk-mq.c b/block/blk-mq.c index 9b4e79e2ac1e..005715206b16 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -2208,6 +2208,19 @@ static void blk_mq_plug_issue_direct(struct blk_plug *plug, bool from_schedule) int queued = 0; int errors = 0; + /* + * Peek first request and see if we have a ->queue_rqs() hook. If we + * do, we can dispatch the whole plug list in one go. We already know + * at this point that all requests belong to the same queue, caller + * must ensure that's the case. + */ + rq = rq_list_peek(&plug->mq_list); + if (rq->q->mq_ops->queue_rqs) { + rq->q->mq_ops->queue_rqs(&plug->mq_list); + if (rq_list_empty(plug->mq_list)) + return; + } + while ((rq = rq_list_pop(&plug->mq_list))) { bool last = rq_list_empty(plug->mq_list); blk_status_t ret; @@ -2256,6 +2269,10 @@ void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule) if (!plug->multiple_queues && !plug->has_elevator && !from_schedule) { blk_mq_plug_issue_direct(plug, false); + /* + * Expected case, all requests got dispatched. If not, fall + * through to individual dispatch of the remainder. + */ if (rq_list_empty(plug->mq_list)) return; } diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h index 3ba1e750067b..897cf475e7eb 100644 --- a/include/linux/blk-mq.h +++ b/include/linux/blk-mq.h @@ -503,6 +503,14 @@ struct blk_mq_ops { */ void (*commit_rqs)(struct blk_mq_hw_ctx *); + /** + * @queue_rqs: Queue a list of new requests. Driver is guaranteed + * that each request belongs to the same queue. If the driver doesn't + * empty the @rqlist completely, then the rest will be queued + * individually by the block layer upon return. + */ + void (*queue_rqs)(struct request **rqlist); + /** * @get_budget: Reserve budget before queue request, once .queue_rq is * run, it is driver's responsibility to release the
If we have a list of requests in our plug list, send it to the driver in one go, if possible. The driver must set mq_ops->queue_rqs() to support this, if not the usual one-by-one path is used. Signed-off-by: Jens Axboe <axboe@kernel.dk> --- block/blk-mq.c | 17 +++++++++++++++++ include/linux/blk-mq.h | 8 ++++++++ 2 files changed, 25 insertions(+)