diff mbox

[09/12] nvme: properly free resources for cancelled command

Message ID 1446885906-20967-10-git-send-email-hch@lst.de (mailing list archive)
State Superseded, archived
Delegated to: Jens Axboe
Headers show

Commit Message

Christoph Hellwig Nov. 7, 2015, 8:45 a.m. UTC
We need to move freeing of resources to the ->complete handler to ensure
they are also freed when we cancel the command.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/nvme/host/pci.c | 86 +++++++++++++++++++++++++++----------------------
 1 file changed, 47 insertions(+), 39 deletions(-)

Comments

Keith Busch Nov. 9, 2015, 6:57 p.m. UTC | #1
On Sat, Nov 07, 2015 at 09:45:03AM +0100, Christoph Hellwig wrote:
> +	if (unlikely(req->errors)) {
> +		/*
> +		 * Some silly Intel userspace code breaks if it doesn't get a
> +		 * negative errno back for driver returns values.
> +		 */

Whoa now, it's neither Intel nor userpace that needs this. It's to know
if the controller is unresponsive or returned an error. The difference
matters to the driver for initialization.

> +		if (req->errors < 0) {
> +			error = req->errors;
> +		} else {
> +			if (nvme_req_needs_retry(req, req->errors)) {
> +				nvme_requeue_req(req);
> +				return;
> +			}
> +
> +			if (req->cmd_type == REQ_TYPE_DRV_PRIV)
> +				error = req->errors;
> +			else
> +				error = nvme_error_status(req->errors);
> +		}
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig Nov. 9, 2015, 7:25 p.m. UTC | #2
On Mon, Nov 09, 2015 at 06:57:31PM +0000, Keith Busch wrote:
> On Sat, Nov 07, 2015 at 09:45:03AM +0100, Christoph Hellwig wrote:
> > +	if (unlikely(req->errors)) {
> > +		/*
> > +		 * Some silly Intel userspace code breaks if it doesn't get a
> > +		 * negative errno back for driver returns values.
> > +		 */
> 
> Whoa now, it's neither Intel nor userpace that needs this. It's to know
> if the controller is unresponsive or returned an error. The difference
> matters to the driver for initialization.

Haha, so we at least can root cause this now.  Can you point me
to the caller that cares?  I'd really like to get rid of the special
case of passing a negative errno here, so I'd like to figure out how
we could pass this information on instead.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Keith Busch Nov. 9, 2015, 8:12 p.m. UTC | #3
On Mon, Nov 09, 2015 at 08:25:19PM +0100, Christoph Hellwig wrote:
> On Mon, Nov 09, 2015 at 06:57:31PM +0000, Keith Busch wrote:
> > Whoa now, it's neither Intel nor userpace that needs this. It's to know
> > if the controller is unresponsive or returned an error. The difference
> > matters to the driver for initialization.
> 
> Haha, so we at least can root cause this now.  Can you point me
> to the caller that cares?  I'd really like to get rid of the special
> case of passing a negative errno here, so I'd like to figure out how
> we could pass this information on instead.

The "set_queue_count()" was the function that cared, but looks like
patch 8/12 makes this think a timeout is an aborted status.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig Nov. 10, 2015, 8:13 a.m. UTC | #4
On Mon, Nov 09, 2015 at 08:12:33PM +0000, Keith Busch wrote:
> > Haha, so we at least can root cause this now.  Can you point me
> > to the caller that cares?  I'd really like to get rid of the special
> > case of passing a negative errno here, so I'd like to figure out how
> > we could pass this information on instead.
> 
> The "set_queue_count()" was the function that cared, but looks like
> patch 8/12 makes this think a timeout is an aborted status.

I'm still a bit confused on the semantics you want/need.  If the 
Set Features for set_queue_count times out we'll call the reset handler,
which because we are inside the probe handler will remove the device.
How do we care about the return value in that case?

Can you write down a few sentences on why/how we care?  I'll volunteer
to put them into the driver in comment form once we have all this sorted
out so that anyone touching the driver in the future won't be as confused.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Keith Busch Nov. 10, 2015, 4:03 p.m. UTC | #5
On Tue, Nov 10, 2015 at 09:13:57AM +0100, Christoph Hellwig wrote:
> Set Features for set_queue_count times out we'll call the reset handler,
> which because we are inside the probe handler will remove the device.
> How do we care about the return value in that case?
> 
> Can you write down a few sentences on why/how we care?  I'll volunteer
> to put them into the driver in comment form once we have all this sorted
> out so that anyone touching the driver in the future won't be as confused.

Perhaps I am thinking how probing serially worked before, and don't
understand how this works anymore. :)

You're right, we don't really care anymore if the reset handler unwinds
it. This path is then safe to see a fake error code.

But the reset handler is the same "work" as probe now, so it won't get
scheduled. Now I completely understand why we changed nvme_timeout()
to end the request with -EIO instead of waiting for the reset work to
cancel it. That's still unsafe since it frees the command for reuse
while the ID is still technically owned by the controller.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Keith Busch Nov. 10, 2015, 8:28 p.m. UTC | #6
On Tue, Nov 10, 2015 at 04:03:45PM +0000, Keith Busch wrote:
> On Tue, Nov 10, 2015 at 09:13:57AM +0100, Christoph Hellwig wrote:
> > Set Features for set_queue_count times out we'll call the reset handler,
> > which because we are inside the probe handler will remove the device.
> > How do we care about the return value in that case?
> > 
> > Can you write down a few sentences on why/how we care?  I'll volunteer
> > to put them into the driver in comment form once we have all this sorted
> > out so that anyone touching the driver in the future won't be as confused.
> 
> Perhaps I am thinking how probing serially worked before, and don't
> understand how this works anymore. :)
> 
> You're right, we don't really care anymore if the reset handler unwinds
> it. This path is then safe to see a fake error code.

Actually this still needs to be a negative error so nvme_reset_work
doesn't clear "NVME_CTRL_RESETTING" bit. Without that, the driver gets
in an infinite reset loop.

 
> But the reset handler is the same "work" as probe now, so it won't get
> scheduled. Now I completely understand why we changed nvme_timeout()
> to end the request with -EIO instead of waiting for the reset work to
> cancel it. That's still unsafe since it frees the command for reuse
> while the ID is still technically owned by the controller.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig Nov. 16, 2015, 10:05 a.m. UTC | #7
On Tue, Nov 10, 2015 at 08:28:11PM +0000, Keith Busch wrote:
> > Perhaps I am thinking how probing serially worked before, and don't
> > understand how this works anymore. :)
> > 
> > You're right, we don't really care anymore if the reset handler unwinds
> > it. This path is then safe to see a fake error code.
> 
> Actually this still needs to be a negative error so nvme_reset_work
> doesn't clear "NVME_CTRL_RESETTING" bit. Without that, the driver gets
> in an infinite reset loop.

I don't think this is going to help as we'll never enter the reset loop
now that we have a single reset_work item, oops.

I guess we'll just need to set the aborted status from nvme_timeout
if we are in the reset handler already so that it can handle the
errors directly instead of waiting for another reset_work do be queued
up. I'll try to come up with a version of that.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 03b8a3c..a5ee159 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -76,12 +76,10 @@  static wait_queue_head_t nvme_kthread_wait;
 
 struct nvme_dev;
 struct nvme_queue;
-struct nvme_iod;
 
 static int nvme_reset(struct nvme_dev *dev);
 static int nvme_process_cq(struct nvme_queue *nvmeq);
 static void nvme_remove_dead_ctrl(struct nvme_dev *dev);
-static void nvme_unmap_data(struct nvme_dev *dev, struct nvme_iod *iod);
 
 struct async_cmd_info {
 	struct kthread_work work;
@@ -483,42 +481,6 @@  static void nvme_dif_complete(u32 p, u32 v, struct t10_pi_tuple *pi)
 }
 #endif
 
-static void req_completion(struct nvme_queue *nvmeq, struct nvme_completion *cqe)
-{
-	struct request *req = blk_mq_tag_to_rq(*nvmeq->tags, cqe->command_id);
-	struct nvme_cmd_info *cmd_rq = blk_mq_rq_to_pdu(req);
-	struct nvme_iod *iod = cmd_rq->iod;
-	u16 status = le16_to_cpup(&cqe->status) >> 1;
-	int error = 0;
-
-	if (unlikely(status)) {
-		if (nvme_req_needs_retry(req, status)) {
-			nvme_unmap_data(nvmeq->dev, iod);
-			nvme_requeue_req(req);
-			return;
-		}
-
-		if (req->cmd_type == REQ_TYPE_DRV_PRIV) {
-			error = status;
-		} else {
-			error = nvme_error_status(status);
-		}
-	}
-
-	if (req->cmd_type == REQ_TYPE_DRV_PRIV) {
-		u32 result = le32_to_cpup(&cqe->result);
-		req->special = (void *)(uintptr_t)result;
-	}
-
-	if (cmd_rq->aborted)
-		dev_warn(nvmeq->dev->dev,
-			"completing aborted command with status:%04x\n",
-			error);
-
-	nvme_unmap_data(nvmeq->dev, iod);
-	blk_mq_complete_request(req, error);
-}
-
 static bool nvme_setup_prps(struct nvme_dev *dev, struct nvme_iod *iod,
 		int total_len)
 {
@@ -760,6 +722,43 @@  out:
 	return ret;
 }
 
+static void nvme_complete_rq(struct request *req)
+{
+	struct nvme_cmd_info *cmd = blk_mq_rq_to_pdu(req);
+	struct nvme_dev *dev = cmd->nvmeq->dev;
+	int error = 0;
+
+	nvme_unmap_data(dev, cmd->iod);
+
+	if (unlikely(req->errors)) {
+		/*
+		 * Some silly Intel userspace code breaks if it doesn't get a
+		 * negative errno back for driver returns values.
+		 */
+		if (req->errors < 0) {
+			error = req->errors;
+		} else {
+			if (nvme_req_needs_retry(req, req->errors)) {
+				nvme_requeue_req(req);
+				return;
+			}
+
+			if (req->cmd_type == REQ_TYPE_DRV_PRIV)
+				error = req->errors;
+			else
+				error = nvme_error_status(req->errors);
+		}
+	}
+
+	if (cmd->aborted) {
+		dev_warn(dev->dev,
+			"completing aborted command with status:%04x\n",
+			req->errors);
+	}
+
+	blk_mq_end_request(req, error);
+}
+
 static int nvme_process_cq(struct nvme_queue *nvmeq)
 {
 	u16 head, phase;
@@ -770,6 +769,7 @@  static int nvme_process_cq(struct nvme_queue *nvmeq)
 	for (;;) {
 		struct nvme_completion cqe = nvmeq->cqes[head];
 		u16 status = le16_to_cpu(cqe.status);
+		struct request *req;
 
 		if ((status & 1) != phase)
 			break;
@@ -798,7 +798,13 @@  static int nvme_process_cq(struct nvme_queue *nvmeq)
 			continue;
 		}
 
-		req_completion(nvmeq, &cqe);
+		req = blk_mq_tag_to_rq(*nvmeq->tags, cqe.command_id);
+		if (req->cmd_type == REQ_TYPE_DRV_PRIV) {
+			u32 result = le32_to_cpu(cqe.result);
+			req->special = (void *)(uintptr_t)result;
+		}
+		blk_mq_complete_request(req, status >> 1);
+
 	}
 
 	/* If the controller ignores the cq head doorbell and continuously
@@ -1297,6 +1303,7 @@  static int nvme_shutdown_ctrl(struct nvme_dev *dev)
 
 static struct blk_mq_ops nvme_mq_admin_ops = {
 	.queue_rq	= nvme_queue_rq,
+	.complete	= nvme_complete_rq,
 	.map_queue	= blk_mq_map_queue,
 	.init_hctx	= nvme_admin_init_hctx,
 	.exit_hctx      = nvme_admin_exit_hctx,
@@ -1306,6 +1313,7 @@  static struct blk_mq_ops nvme_mq_admin_ops = {
 
 static struct blk_mq_ops nvme_mq_ops = {
 	.queue_rq	= nvme_queue_rq,
+	.complete	= nvme_complete_rq,
 	.map_queue	= blk_mq_map_queue,
 	.init_hctx	= nvme_init_hctx,
 	.init_request	= nvme_init_request,