diff mbox series

[5/5] nvme/pci: Remove queue IO flushing hack

Message ID 20190308174006.5032-5-keith.busch@intel.com (mailing list archive)
State New, archived
Headers show
Series [1/5] blk-mq: Export reading mq request state | expand

Commit Message

Keith Busch March 8, 2019, 5:40 p.m. UTC
The nvme driver checked the queue state on every IO so this path could
be used to flush out entered requests to a failed completion. The code
even mentions in comments that we shouldn't have to do this, so let's
not do it.

Use the blk-mq's tag iterator to end all entered requests when the queue
isn't going to be restarted so the IO path doesn't have to deal with
these conditions.

Signed-off-by: Keith Busch <keith.busch@intel.com>
---
 drivers/nvme/host/pci.c | 45 +++++++++++++++++++++++++++++----------------
 1 file changed, 29 insertions(+), 16 deletions(-)

Comments

Bart Van Assche March 8, 2019, 6:19 p.m. UTC | #1
On Fri, 2019-03-08 at 10:40 -0700, Keith Busch wrote:
> +static bool nvme_fail_queue_request(struct request *req, void *data, bool reserved)
> +{
> +	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
> +	struct nvme_queue *nvmeq = iod->nvmeq;
> +
> +	if (!test_bit(NVMEQ_ENABLED, &nvmeq->flags))
> +		blk_mq_end_request(req, BLK_STS_IOERR);
> +	return true;
> +}

Same question here as for patch 4/5: what prevents concurrent calls of
blk_mq_complete_request() with the blk_mq_end_request() call in the
above function?

Thanks,

Bart.
Christoph Hellwig March 11, 2019, 6:40 p.m. UTC | #2
From a quick look the code seems reasonably sensible here,
but any chance we could have this in common code?

> +static bool nvme_fail_queue_request(struct request *req, void *data, bool reserved)
> +{
> +	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
> +	struct nvme_queue *nvmeq = iod->nvmeq;
> +
> +	if (!test_bit(NVMEQ_ENABLED, &nvmeq->flags))
> +		blk_mq_end_request(req, BLK_STS_IOERR);
> +	return true;
> +}

The only thing not purely block layer here is the enabled flag.
So if we had a per-hctx enabled flag we could lift this out of nvme,
and hopefully start reusing it in other drivers.
Keith Busch March 11, 2019, 7:37 p.m. UTC | #3
On Mon, Mar 11, 2019 at 07:40:31PM +0100, Christoph Hellwig wrote:
> From a quick look the code seems reasonably sensible here,
> but any chance we could have this in common code?
> 
> > +static bool nvme_fail_queue_request(struct request *req, void *data, bool reserved)
> > +{
> > +	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
> > +	struct nvme_queue *nvmeq = iod->nvmeq;
> > +
> > +	if (!test_bit(NVMEQ_ENABLED, &nvmeq->flags))
> > +		blk_mq_end_request(req, BLK_STS_IOERR);
> > +	return true;
> > +}
> 
> The only thing not purely block layer here is the enabled flag.
> So if we had a per-hctx enabled flag we could lift this out of nvme,
> and hopefully start reusing it in other drivers.

Okay, I may even be able to drop the new block exports if we do request
termination in generic block layer. That's probably the right thing
anyway since that layer is in a better position to check the necessary
conditions that make tag iteration safe. Bart did point out that is
generally not safe for drives to do, so it'd be good to safegaurd against
incorrect usage.
Christoph Hellwig March 27, 2019, 8:31 a.m. UTC | #4
On Mon, Mar 11, 2019 at 01:37:53PM -0600, Keith Busch wrote:
> > The only thing not purely block layer here is the enabled flag.
> > So if we had a per-hctx enabled flag we could lift this out of nvme,
> > and hopefully start reusing it in other drivers.
> 
> Okay, I may even be able to drop the new block exports if we do request
> termination in generic block layer. That's probably the right thing
> anyway since that layer is in a better position to check the necessary
> conditions that make tag iteration safe. Bart did point out that is
> generally not safe for drives to do, so it'd be good to safegaurd against
> incorrect usage.

Did you get a chance to look into this?
Keith Busch March 27, 2019, 1:21 p.m. UTC | #5
On Wed, Mar 27, 2019 at 01:31:42AM -0700, Christoph Hellwig wrote:
> > Okay, I may even be able to drop the new block exports if we do request
> > termination in generic block layer. That's probably the right thing
> > anyway since that layer is in a better position to check the necessary
> > conditions that make tag iteration safe. Bart did point out that is
> > generally not safe for drives to do, so it'd be good to safegaurd against
> > incorrect usage.
> 
> Did you get a chance to look into this?

I haven't had a chance to put it through the proper tests (I no longer
have a hotplug machine), but this is what I'd written if you can give it
a quick look:

From 5afd8e3765eabf859100fda84e646a96683d7751 Mon Sep 17 00:00:00 2001
From: Keith Busch <keith.busch@intel.com>
Date: Tue, 12 Mar 2019 13:58:12 -0600
Subject: [PATCH] blk-mq: Provide request termination API

A driver that determined hardware contexts backing its request queue
is unable to service new commands, it would have to end those commands
in its IO path. This requires unlikely checks per-IO.

Create a new API that terminates all requests directly rather than
requiring those requests get flushed through the low level driver. This
is safe only if the current request allocation state is unchanging, so
driver must have quiesced and initiated a freeze on its queue, and after
it has reclaimed any in flight requests so that the tag iterator is not
racing with in flux requests. The new API enforces these conditions in
order to successfully terminate requests.

Signed-off-by: Keith Busch <keith.busch@intel.com>
---
 block/blk-mq.c         | 36 ++++++++++++++++++++++++++++++++++++
 include/linux/blk-mq.h |  1 +
 2 files changed, 37 insertions(+)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index a9c181603cbd..ad98c27e2b34 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -952,6 +952,42 @@ static void blk_mq_timeout_work(struct work_struct *work)
 	blk_queue_exit(q);
 }
 
+static bool blk_mq_terminate_request(struct blk_mq_hw_ctx *hctx,
+		struct request *rq, void *priv, bool reserved)
+{
+	int *hctx_idx = priv;
+
+	if (WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IDLE))
+		return false;
+	if (hctx->queue_num >= *hctx_idx)
+		blk_mq_end_request(rq, BLK_STS_IOERR);
+	return true;
+}
+
+/**
+ * blk_mq_terminate_queued_requests() - end requests with an error queued on
+ *					hardware contexts at and above the
+ *					provided index.
+ * @q: request queue
+ * @hctx_idx: starting hardware context, 0 for all hctx's
+ *
+ * A low level driver should invoke this routine when its hardware contexts are
+ * not capable of handling future requests. The caller must ensure their
+ * request_queue is quiesced, freeze initiated, and all dispatched requests
+ * have been reclaimed so the tag iteration has a static request allocation to
+ * consider.
+ */
+void blk_mq_terminate_queued_requests(struct request_queue *q, int hctx_idx)
+{
+	if (WARN_ON_ONCE(!atomic_read(&q->mq_freeze_depth)))
+		return;
+	if (WARN_ON_ONCE(!blk_queue_quiesced(q)))
+		return;
+	blk_sync_queue(q);
+	blk_mq_queue_tag_busy_iter(q, blk_mq_terminate_request, &hctx_idx);
+}
+EXPORT_SYMBOL(blk_mq_terminate_queued_requests);
+
 struct flush_busy_ctx_data {
 	struct blk_mq_hw_ctx *hctx;
 	struct list_head *list;
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index a64b3fdce0b0..d47cd4575abb 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -334,6 +334,7 @@ int blk_mq_map_queues(struct blk_mq_queue_map *qmap);
 void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues);
 
 void blk_mq_quiesce_queue_nowait(struct request_queue *q);
+void blk_mq_terminate_queued_requests(struct request_queue *q, int hctx_idx);
 
 unsigned int blk_mq_rq_cpu(struct request *rq);
jianchao.wang March 28, 2019, 1:42 a.m. UTC | #6
On 3/27/19 9:21 PM, Keith Busch wrote:
> On Wed, Mar 27, 2019 at 01:31:42AM -0700, Christoph Hellwig wrote:
>>> Okay, I may even be able to drop the new block exports if we do request
>>> termination in generic block layer. That's probably the right thing
>>> anyway since that layer is in a better position to check the necessary
>>> conditions that make tag iteration safe. Bart did point out that is
>>> generally not safe for drives to do, so it'd be good to safegaurd against
>>> incorrect usage.
>>
>> Did you get a chance to look into this?
> 
> I haven't had a chance to put it through the proper tests (I no longer
> have a hotplug machine), but this is what I'd written if you can give it
> a quick look:
> 
> From 5afd8e3765eabf859100fda84e646a96683d7751 Mon Sep 17 00:00:00 2001
> From: Keith Busch <keith.busch@intel.com>
> Date: Tue, 12 Mar 2019 13:58:12 -0600
> Subject: [PATCH] blk-mq: Provide request termination API
> 
> A driver that determined hardware contexts backing its request queue
> is unable to service new commands, it would have to end those commands
> in its IO path. This requires unlikely checks per-IO.
> 
> Create a new API that terminates all requests directly rather than
> requiring those requests get flushed through the low level driver. This
> is safe only if the current request allocation state is unchanging, so
> driver must have quiesced and initiated a freeze on its queue, and after
> it has reclaimed any in flight requests so that the tag iterator is not
> racing with in flux requests. The new API enforces these conditions in
> order to successfully terminate requests.
> 
> Signed-off-by: Keith Busch <keith.busch@intel.com>
> ---
>  block/blk-mq.c         | 36 ++++++++++++++++++++++++++++++++++++
>  include/linux/blk-mq.h |  1 +
>  2 files changed, 37 insertions(+)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index a9c181603cbd..ad98c27e2b34 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -952,6 +952,42 @@ static void blk_mq_timeout_work(struct work_struct *work)
>  	blk_queue_exit(q);
>  }
>  
> +static bool blk_mq_terminate_request(struct blk_mq_hw_ctx *hctx,
> +		struct request *rq, void *priv, bool reserved)
> +{
> +	int *hctx_idx = priv;
> +
> +	if (WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IDLE))
> +		return false;
> +	if (hctx->queue_num >= *hctx_idx)
> +		blk_mq_end_request(rq, BLK_STS_IOERR);
> +	return true;
> +}
> +
> +/**
> + * blk_mq_terminate_queued_requests() - end requests with an error queued on
> + *					hardware contexts at and above the
> + *					provided index.
> + * @q: request queue
> + * @hctx_idx: starting hardware context, 0 for all hctx's
> + *
> + * A low level driver should invoke this routine when its hardware contexts are
> + * not capable of handling future requests. The caller must ensure their
> + * request_queue is quiesced, freeze initiated, and all dispatched requests
> + * have been reclaimed so the tag iteration has a static request allocation to
> + * consider.
> + */
> +void blk_mq_terminate_queued_requests(struct request_queue *q, int hctx_idx)
> +{
> +	if (WARN_ON_ONCE(!atomic_read(&q->mq_freeze_depth)))
> +		return;
> +	if (WARN_ON_ONCE(!blk_queue_quiesced(q)))
> +		return;
> +	blk_sync_queue(q);
> +	blk_mq_queue_tag_busy_iter(q, blk_mq_terminate_request, &hctx_idx);
> +}


Is it really OK to end these requests directly w/o dequeue ?
All of them are on ctx->rq_list, or hctx->dispatch list or internal queue of io scheduler.

Terrible things may happen after we unquiesce the queue.

Thanks
Jinchao

> +EXPORT_SYMBOL(blk_mq_terminate_queued_requests);
> +
>  struct flush_busy_ctx_data {
>  	struct blk_mq_hw_ctx *hctx;
>  	struct list_head *list;
> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> index a64b3fdce0b0..d47cd4575abb 100644
> --- a/include/linux/blk-mq.h
> +++ b/include/linux/blk-mq.h
> @@ -334,6 +334,7 @@ int blk_mq_map_queues(struct blk_mq_queue_map *qmap);
>  void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues);
>  
>  void blk_mq_quiesce_queue_nowait(struct request_queue *q);
> +void blk_mq_terminate_queued_requests(struct request_queue *q, int hctx_idx);
>  
>  unsigned int blk_mq_rq_cpu(struct request *rq);
>  
>
Keith Busch March 28, 2019, 3:33 a.m. UTC | #7
On Thu, Mar 28, 2019 at 09:42:51AM +0800, jianchao.wang wrote:
> On 3/27/19 9:21 PM, Keith Busch wrote:
> > +void blk_mq_terminate_queued_requests(struct request_queue *q, int hctx_idx)
> > +{
> > +	if (WARN_ON_ONCE(!atomic_read(&q->mq_freeze_depth)))
> > +		return;
> > +	if (WARN_ON_ONCE(!blk_queue_quiesced(q)))
> > +		return;
> > +	blk_sync_queue(q);
> > +	blk_mq_queue_tag_busy_iter(q, blk_mq_terminate_request, &hctx_idx);
> > +}
> 
> 
> Is it really OK to end these requests directly w/o dequeue ?
> All of them are on ctx->rq_list, or hctx->dispatch list or internal queue of io scheduler.
> 
> Terrible things may happen after we unquiesce the queue.

Good point. This was intended as a last action before killing an hctx or
the entire request queue, so I didn't expect they'd be turned back on, but
it's easy enough to splice lists and handle those requests directly. We
wouldn'even need to iterate tags that way, and may be better way to
handle this.
diff mbox series

Patch

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index f54718b63637..398c6333cf77 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -918,13 +918,6 @@  static blk_status_t nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
 	struct nvme_command cmnd;
 	blk_status_t ret;
 
-	/*
-	 * We should not need to do this, but we're still using this to
-	 * ensure we can drain requests on a dying queue.
-	 */
-	if (unlikely(!test_bit(NVMEQ_ENABLED, &nvmeq->flags)))
-		return BLK_STS_IOERR;
-
 	ret = nvme_setup_cmd(ns, req, &cmnd);
 	if (ret)
 		return ret;
@@ -1403,10 +1396,6 @@  static int nvme_suspend_queue(struct nvme_queue *nvmeq)
 {
 	if (!test_and_clear_bit(NVMEQ_ENABLED, &nvmeq->flags))
 		return 1;
-
-	/* ensure that nvme_queue_rq() sees NVMEQ_ENABLED cleared */
-	mb();
-
 	nvmeq->dev->online_queues--;
 	if (!nvmeq->qid && nvmeq->dev->ctrl.admin_q)
 		blk_mq_quiesce_queue(nvmeq->dev->ctrl.admin_q);
@@ -1616,15 +1605,29 @@  static const struct blk_mq_ops nvme_mq_ops = {
 	.poll		= nvme_poll,
 };
 
+static bool nvme_fail_queue_request(struct request *req, void *data, bool reserved)
+{
+	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
+	struct nvme_queue *nvmeq = iod->nvmeq;
+
+	if (!test_bit(NVMEQ_ENABLED, &nvmeq->flags))
+		blk_mq_end_request(req, BLK_STS_IOERR);
+	return true;
+}
+
 static void nvme_dev_remove_admin(struct nvme_dev *dev)
 {
 	if (dev->ctrl.admin_q && !blk_queue_dying(dev->ctrl.admin_q)) {
 		/*
 		 * If the controller was reset during removal, it's possible
-		 * user requests may be waiting on a stopped queue. Start the
-		 * queue to flush these to completion.
+		 * user requests may be waiting on a stopped queue. End all
+		 * entered requests after preventing new requests from
+		 * entering.
 		 */
-		blk_mq_unquiesce_queue(dev->ctrl.admin_q);
+		blk_set_queue_dying(dev->ctrl.admin_q);
+		blk_mq_tagset_busy_iter(&dev->admin_tagset,
+					nvme_fail_queue_request,
+					NULL);
 		blk_cleanup_queue(dev->ctrl.admin_q);
 		blk_mq_free_tag_set(&dev->admin_tagset);
 	}
@@ -2435,6 +2438,14 @@  static void nvme_pci_disable(struct nvme_dev *dev)
 	}
 }
 
+static void nvme_fail_requests(struct nvme_dev *dev)
+{
+	if (!dev->ctrl.tagset)
+		return;
+	blk_mq_tagset_busy_iter(dev->ctrl.tagset, nvme_fail_queue_request,
+				NULL);
+}
+
 static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 {
 	bool dead = true;
@@ -2475,11 +2486,11 @@  static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 
 	/*
 	 * The driver will not be starting up queues again if shutting down so
-	 * must flush all entered requests to their failed completion to avoid
+	 * must end all entered requests to their failed completion to avoid
 	 * deadlocking blk-mq hot-cpu notifier.
 	 */
 	if (shutdown)
-		nvme_start_queues(&dev->ctrl);
+		nvme_fail_requests(dev);
 	mutex_unlock(&dev->shutdown_lock);
 }
 
@@ -2624,6 +2635,8 @@  static void nvme_reset_work(struct work_struct *work)
 		nvme_remove_namespaces(&dev->ctrl);
 		new_state = NVME_CTRL_ADMIN_ONLY;
 	} else {
+		/* Fail requests that entered an hctx that no longer exists */
+		nvme_fail_requests(dev);
 		nvme_start_queues(&dev->ctrl);
 		nvme_wait_freeze(&dev->ctrl);
 		/* hit this only when allocate tagset fails */