diff mbox

[PATCHv2,2/2] nvme: Complete all stuck requests

Message ID 1487896561-10454-2-git-send-email-keith.busch@intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Keith Busch Feb. 24, 2017, 12:36 a.m. UTC
If the block layer has entered requests and gets a CPU hot plug event
prior to the resume event, it will wait for those requests to exit. If
the nvme driver is shutting down, it will not start the queues back up,
preventing forward progress.

To fix that, this patch freezes the request queues when the driver intends
to shut down the controller so that no new requests may enter.  After the
controller has been disabled, the queues will be restarted to force all
entered requests to end in failure so that blk-mq's hot cpu notifier may
progress. To ensure the queue usage count is 0 on a shutdown, the driver
waits for freeze to complete before completing the controller shutdown.

On resume, the driver will unfreeze the queue for new requests to enter
once the hardware contexts are reinitialized.

Signed-off-by: Keith Busch <keith.busch@intel.com>
---
v1 -> v2:
  Simplified the freeze and waiting, using the new blk API.

 drivers/nvme/host/core.c | 33 +++++++++++++++++++++++++++++++++
 drivers/nvme/host/nvme.h |  3 +++
 drivers/nvme/host/pci.c  | 14 ++++++++++++++
 3 files changed, 50 insertions(+)

Comments

Christoph Hellwig Feb. 24, 2017, 6:50 a.m. UTC | #1
Thanks Keith,

this looks much simpler so that even I can understand it :)

Reviewed-by: Christoph Hellwig <hch@lst.de>
Sagi Grimberg Feb. 27, 2017, 1:46 p.m. UTC | #2
On 24/02/17 02:36, Keith Busch wrote:
> If the block layer has entered requests and gets a CPU hot plug event
> prior to the resume event, it will wait for those requests to exit. If
> the nvme driver is shutting down, it will not start the queues back up,
> preventing forward progress.
>
> To fix that, this patch freezes the request queues when the driver intends
> to shut down the controller so that no new requests may enter.  After the
> controller has been disabled, the queues will be restarted to force all
> entered requests to end in failure so that blk-mq's hot cpu notifier may
> progress. To ensure the queue usage count is 0 on a shutdown, the driver
> waits for freeze to complete before completing the controller shutdown.

Keith, can you explain (again) for me why is the freeze_wait must happen
after the controller has been disabled, instead of starting the queues
and waiting right after freeze start?
Keith Busch Feb. 27, 2017, 3:01 p.m. UTC | #3
On Mon, Feb 27, 2017 at 03:46:09PM +0200, Sagi Grimberg wrote:
> On 24/02/17 02:36, Keith Busch wrote:
> > If the block layer has entered requests and gets a CPU hot plug event
> > prior to the resume event, it will wait for those requests to exit. If
> > the nvme driver is shutting down, it will not start the queues back up,
> > preventing forward progress.
> > 
> > To fix that, this patch freezes the request queues when the driver intends
> > to shut down the controller so that no new requests may enter.  After the
> > controller has been disabled, the queues will be restarted to force all
> > entered requests to end in failure so that blk-mq's hot cpu notifier may
> > progress. To ensure the queue usage count is 0 on a shutdown, the driver
> > waits for freeze to complete before completing the controller shutdown.
> 
> Keith, can you explain (again) for me why is the freeze_wait must happen
> after the controller has been disabled, instead of starting the queues
> and waiting right after freeze start?

Yeah, the driver needs to make forward progress even if the controller
isn't functioning. If we do the freeze wait before disabling the
controller, there's no way to reclaim missing completions. If the
controller is working perfectly, it'd be okay, but the driver would be
stuck if there's a problem.
Sagi Grimberg Feb. 27, 2017, 5:27 p.m. UTC | #4
>>> If the block layer has entered requests and gets a CPU hot plug event
>>> prior to the resume event, it will wait for those requests to exit. If
>>> the nvme driver is shutting down, it will not start the queues back up,
>>> preventing forward progress.
>>>
>>> To fix that, this patch freezes the request queues when the driver intends
>>> to shut down the controller so that no new requests may enter.  After the
>>> controller has been disabled, the queues will be restarted to force all
>>> entered requests to end in failure so that blk-mq's hot cpu notifier may
>>> progress. To ensure the queue usage count is 0 on a shutdown, the driver
>>> waits for freeze to complete before completing the controller shutdown.
>>
>> Keith, can you explain (again) for me why is the freeze_wait must happen
>> after the controller has been disabled, instead of starting the queues
>> and waiting right after freeze start?
>
> Yeah, the driver needs to make forward progress even if the controller
> isn't functioning. If we do the freeze wait before disabling the
> controller, there's no way to reclaim missing completions. If the
> controller is working perfectly, it'd be okay, but the driver would be
> stuck if there's a problem.

OK, I think we can get it for fabrics too, need to figure out how to
handle it there too.

Do you have a reproducer?
Keith Busch Feb. 27, 2017, 7:15 p.m. UTC | #5
On Mon, Feb 27, 2017 at 07:27:51PM +0200, Sagi Grimberg wrote:
> OK, I think we can get it for fabrics too, need to figure out how to
> handle it there too.
> 
> Do you have a reproducer?

To repro, I have to run a buffered writer workload then put the system into S3.

This fio job seems to reproduce for me:

  fio --name=global --filename=/dev/nvme0n1 --bsrange=4k-128k --rw=randwrite --ioengine=libaio --iodepth=8 --numjobs=8 --name=foobar

I use rtcwake to test suspend/resume:

  rtcwake -m mem -s 10

Without the patch we'll get stuck after "Disabling non-boot CPUs ..."
when blk-mq waits to freeze some entered queues after nvme was disabled.
Artur Paszkiewicz Feb. 28, 2017, 7:42 a.m. UTC | #6
On 02/27/2017 08:15 PM, Keith Busch wrote:
> On Mon, Feb 27, 2017 at 07:27:51PM +0200, Sagi Grimberg wrote:
>> OK, I think we can get it for fabrics too, need to figure out how to
>> handle it there too.
>>
>> Do you have a reproducer?
> 
> To repro, I have to run a buffered writer workload then put the system into S3.
> 
> This fio job seems to reproduce for me:
> 
>   fio --name=global --filename=/dev/nvme0n1 --bsrange=4k-128k --rw=randwrite --ioengine=libaio --iodepth=8 --numjobs=8 --name=foobar
> 
> I use rtcwake to test suspend/resume:
> 
>   rtcwake -m mem -s 10
> 
> Without the patch we'll get stuck after "Disabling non-boot CPUs ..."
> when blk-mq waits to freeze some entered queues after nvme was disabled.

I'm observing the same thing when hibernating during mdraid resync on
nvme - it hangs in blk_mq_freeze_queue_wait() after "Disabling non-boot
CPUs ...". This patch did not help but when I put nvme_wait_freeze()
right after nvme_start_freeze() it appeared to be working. Maybe the
difference here is that requests are submitted from a non-freezable
kernel thread (md sync_thread)?

Thanks,
Artur
Sagi Grimberg Feb. 28, 2017, 11:45 a.m. UTC | #7
>>> OK, I think we can get it for fabrics too, need to figure out how to
>>> handle it there too.
>>>
>>> Do you have a reproducer?
>>
>> To repro, I have to run a buffered writer workload then put the system into S3.
>>
>> This fio job seems to reproduce for me:
>>
>>   fio --name=global --filename=/dev/nvme0n1 --bsrange=4k-128k --rw=randwrite --ioengine=libaio --iodepth=8 --numjobs=8 --name=foobar
>>
>> I use rtcwake to test suspend/resume:
>>
>>   rtcwake -m mem -s 10
>>
>> Without the patch we'll get stuck after "Disabling non-boot CPUs ..."
>> when blk-mq waits to freeze some entered queues after nvme was disabled.
>
> I'm observing the same thing when hibernating during mdraid resync on
> nvme - it hangs in blk_mq_freeze_queue_wait() after "Disabling non-boot
> CPUs ...". This patch did not help but when I put nvme_wait_freeze()
> right after nvme_start_freeze() it appeared to be working.

Interesting. did the nvme device succeeded to shutdown at all?

> Maybe the
> difference here is that requests are submitted from a non-freezable
> kernel thread (md sync_thread)?

Don't think its related...
diff mbox

Patch

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 25ec4e5..9e99b94 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -2344,6 +2344,39 @@  void nvme_kill_queues(struct nvme_ctrl *ctrl)
 }
 EXPORT_SYMBOL_GPL(nvme_kill_queues);
 
+void nvme_unfreeze(struct nvme_ctrl *ctrl)
+{
+	struct nvme_ns *ns;
+
+	mutex_lock(&ctrl->namespaces_mutex);
+	list_for_each_entry(ns, &ctrl->namespaces, list)
+		blk_mq_unfreeze_queue(ns->queue);
+	mutex_unlock(&ctrl->namespaces_mutex);
+}
+EXPORT_SYMBOL_GPL(nvme_unfreeze);
+
+void nvme_wait_freeze(struct nvme_ctrl *ctrl)
+{
+	struct nvme_ns *ns;
+
+	mutex_lock(&ctrl->namespaces_mutex);
+	list_for_each_entry(ns, &ctrl->namespaces, list)
+		blk_mq_freeze_queue_wait(ns->queue);
+	mutex_unlock(&ctrl->namespaces_mutex);
+}
+EXPORT_SYMBOL_GPL(nvme_wait_freeze);
+
+void nvme_start_freeze(struct nvme_ctrl *ctrl)
+{
+	struct nvme_ns *ns;
+
+	mutex_lock(&ctrl->namespaces_mutex);
+	list_for_each_entry(ns, &ctrl->namespaces, list)
+		blk_mq_freeze_queue_start(ns->queue);
+	mutex_unlock(&ctrl->namespaces_mutex);
+}
+EXPORT_SYMBOL_GPL(nvme_start_freeze);
+
 void nvme_stop_queues(struct nvme_ctrl *ctrl)
 {
 	struct nvme_ns *ns;
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index a3da1e9..62af901 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -294,6 +294,9 @@  void nvme_queue_async_events(struct nvme_ctrl *ctrl);
 void nvme_stop_queues(struct nvme_ctrl *ctrl);
 void nvme_start_queues(struct nvme_ctrl *ctrl);
 void nvme_kill_queues(struct nvme_ctrl *ctrl);
+void nvme_unfreeze(struct nvme_ctrl *ctrl);
+void nvme_wait_freeze(struct nvme_ctrl *ctrl);
+void nvme_start_freeze(struct nvme_ctrl *ctrl);
 
 #define NVME_QID_ANY -1
 struct request *nvme_alloc_request(struct request_queue *q,
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 57a1af5..ce80cb5 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1676,6 +1676,8 @@  static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 	del_timer_sync(&dev->watchdog_timer);
 
 	mutex_lock(&dev->shutdown_lock);
+	if (shutdown)
+		nvme_start_freeze(&dev->ctrl);
 	if (pci_is_enabled(to_pci_dev(dev->dev))) {
 		nvme_stop_queues(&dev->ctrl);
 		csts = readl(dev->bar + NVME_REG_CSTS);
@@ -1700,6 +1702,16 @@  static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 
 	blk_mq_tagset_busy_iter(&dev->tagset, nvme_cancel_request, &dev->ctrl);
 	blk_mq_tagset_busy_iter(&dev->admin_tagset, nvme_cancel_request, &dev->ctrl);
+
+	/*
+	 * The driver will not be starting up queues again if shutting down so
+	 * must flush all entered requests to their failed completion to avoid
+	 * deadlocking blk-mq hot-cpu notifier.
+	 */
+	if (shutdown) {
+		nvme_start_queues(&dev->ctrl);
+		nvme_wait_freeze(&dev->ctrl);
+	}
 	mutex_unlock(&dev->shutdown_lock);
 }
 
@@ -1823,6 +1835,8 @@  static void nvme_reset_work(struct work_struct *work)
 	} else {
 		nvme_start_queues(&dev->ctrl);
 		nvme_dev_add(dev);
+		if (was_suspended)
+			nvme_unfreeze(&dev->ctrl);
 	}
 
 	if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_LIVE)) {