Message ID | 1487896561-10454-2-git-send-email-keith.busch@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Thanks Keith,
this looks much simpler so that even I can understand it :)
Reviewed-by: Christoph Hellwig <hch@lst.de>
On 24/02/17 02:36, Keith Busch wrote: > If the block layer has entered requests and gets a CPU hot plug event > prior to the resume event, it will wait for those requests to exit. If > the nvme driver is shutting down, it will not start the queues back up, > preventing forward progress. > > To fix that, this patch freezes the request queues when the driver intends > to shut down the controller so that no new requests may enter. After the > controller has been disabled, the queues will be restarted to force all > entered requests to end in failure so that blk-mq's hot cpu notifier may > progress. To ensure the queue usage count is 0 on a shutdown, the driver > waits for freeze to complete before completing the controller shutdown. Keith, can you explain (again) for me why is the freeze_wait must happen after the controller has been disabled, instead of starting the queues and waiting right after freeze start?
On Mon, Feb 27, 2017 at 03:46:09PM +0200, Sagi Grimberg wrote: > On 24/02/17 02:36, Keith Busch wrote: > > If the block layer has entered requests and gets a CPU hot plug event > > prior to the resume event, it will wait for those requests to exit. If > > the nvme driver is shutting down, it will not start the queues back up, > > preventing forward progress. > > > > To fix that, this patch freezes the request queues when the driver intends > > to shut down the controller so that no new requests may enter. After the > > controller has been disabled, the queues will be restarted to force all > > entered requests to end in failure so that blk-mq's hot cpu notifier may > > progress. To ensure the queue usage count is 0 on a shutdown, the driver > > waits for freeze to complete before completing the controller shutdown. > > Keith, can you explain (again) for me why is the freeze_wait must happen > after the controller has been disabled, instead of starting the queues > and waiting right after freeze start? Yeah, the driver needs to make forward progress even if the controller isn't functioning. If we do the freeze wait before disabling the controller, there's no way to reclaim missing completions. If the controller is working perfectly, it'd be okay, but the driver would be stuck if there's a problem.
>>> If the block layer has entered requests and gets a CPU hot plug event >>> prior to the resume event, it will wait for those requests to exit. If >>> the nvme driver is shutting down, it will not start the queues back up, >>> preventing forward progress. >>> >>> To fix that, this patch freezes the request queues when the driver intends >>> to shut down the controller so that no new requests may enter. After the >>> controller has been disabled, the queues will be restarted to force all >>> entered requests to end in failure so that blk-mq's hot cpu notifier may >>> progress. To ensure the queue usage count is 0 on a shutdown, the driver >>> waits for freeze to complete before completing the controller shutdown. >> >> Keith, can you explain (again) for me why is the freeze_wait must happen >> after the controller has been disabled, instead of starting the queues >> and waiting right after freeze start? > > Yeah, the driver needs to make forward progress even if the controller > isn't functioning. If we do the freeze wait before disabling the > controller, there's no way to reclaim missing completions. If the > controller is working perfectly, it'd be okay, but the driver would be > stuck if there's a problem. OK, I think we can get it for fabrics too, need to figure out how to handle it there too. Do you have a reproducer?
On Mon, Feb 27, 2017 at 07:27:51PM +0200, Sagi Grimberg wrote: > OK, I think we can get it for fabrics too, need to figure out how to > handle it there too. > > Do you have a reproducer? To repro, I have to run a buffered writer workload then put the system into S3. This fio job seems to reproduce for me: fio --name=global --filename=/dev/nvme0n1 --bsrange=4k-128k --rw=randwrite --ioengine=libaio --iodepth=8 --numjobs=8 --name=foobar I use rtcwake to test suspend/resume: rtcwake -m mem -s 10 Without the patch we'll get stuck after "Disabling non-boot CPUs ..." when blk-mq waits to freeze some entered queues after nvme was disabled.
On 02/27/2017 08:15 PM, Keith Busch wrote: > On Mon, Feb 27, 2017 at 07:27:51PM +0200, Sagi Grimberg wrote: >> OK, I think we can get it for fabrics too, need to figure out how to >> handle it there too. >> >> Do you have a reproducer? > > To repro, I have to run a buffered writer workload then put the system into S3. > > This fio job seems to reproduce for me: > > fio --name=global --filename=/dev/nvme0n1 --bsrange=4k-128k --rw=randwrite --ioengine=libaio --iodepth=8 --numjobs=8 --name=foobar > > I use rtcwake to test suspend/resume: > > rtcwake -m mem -s 10 > > Without the patch we'll get stuck after "Disabling non-boot CPUs ..." > when blk-mq waits to freeze some entered queues after nvme was disabled. I'm observing the same thing when hibernating during mdraid resync on nvme - it hangs in blk_mq_freeze_queue_wait() after "Disabling non-boot CPUs ...". This patch did not help but when I put nvme_wait_freeze() right after nvme_start_freeze() it appeared to be working. Maybe the difference here is that requests are submitted from a non-freezable kernel thread (md sync_thread)? Thanks, Artur
>>> OK, I think we can get it for fabrics too, need to figure out how to >>> handle it there too. >>> >>> Do you have a reproducer? >> >> To repro, I have to run a buffered writer workload then put the system into S3. >> >> This fio job seems to reproduce for me: >> >> fio --name=global --filename=/dev/nvme0n1 --bsrange=4k-128k --rw=randwrite --ioengine=libaio --iodepth=8 --numjobs=8 --name=foobar >> >> I use rtcwake to test suspend/resume: >> >> rtcwake -m mem -s 10 >> >> Without the patch we'll get stuck after "Disabling non-boot CPUs ..." >> when blk-mq waits to freeze some entered queues after nvme was disabled. > > I'm observing the same thing when hibernating during mdraid resync on > nvme - it hangs in blk_mq_freeze_queue_wait() after "Disabling non-boot > CPUs ...". This patch did not help but when I put nvme_wait_freeze() > right after nvme_start_freeze() it appeared to be working. Interesting. did the nvme device succeeded to shutdown at all? > Maybe the > difference here is that requests are submitted from a non-freezable > kernel thread (md sync_thread)? Don't think its related...
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 25ec4e5..9e99b94 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -2344,6 +2344,39 @@ void nvme_kill_queues(struct nvme_ctrl *ctrl) } EXPORT_SYMBOL_GPL(nvme_kill_queues); +void nvme_unfreeze(struct nvme_ctrl *ctrl) +{ + struct nvme_ns *ns; + + mutex_lock(&ctrl->namespaces_mutex); + list_for_each_entry(ns, &ctrl->namespaces, list) + blk_mq_unfreeze_queue(ns->queue); + mutex_unlock(&ctrl->namespaces_mutex); +} +EXPORT_SYMBOL_GPL(nvme_unfreeze); + +void nvme_wait_freeze(struct nvme_ctrl *ctrl) +{ + struct nvme_ns *ns; + + mutex_lock(&ctrl->namespaces_mutex); + list_for_each_entry(ns, &ctrl->namespaces, list) + blk_mq_freeze_queue_wait(ns->queue); + mutex_unlock(&ctrl->namespaces_mutex); +} +EXPORT_SYMBOL_GPL(nvme_wait_freeze); + +void nvme_start_freeze(struct nvme_ctrl *ctrl) +{ + struct nvme_ns *ns; + + mutex_lock(&ctrl->namespaces_mutex); + list_for_each_entry(ns, &ctrl->namespaces, list) + blk_mq_freeze_queue_start(ns->queue); + mutex_unlock(&ctrl->namespaces_mutex); +} +EXPORT_SYMBOL_GPL(nvme_start_freeze); + void nvme_stop_queues(struct nvme_ctrl *ctrl) { struct nvme_ns *ns; diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h index a3da1e9..62af901 100644 --- a/drivers/nvme/host/nvme.h +++ b/drivers/nvme/host/nvme.h @@ -294,6 +294,9 @@ void nvme_queue_async_events(struct nvme_ctrl *ctrl); void nvme_stop_queues(struct nvme_ctrl *ctrl); void nvme_start_queues(struct nvme_ctrl *ctrl); void nvme_kill_queues(struct nvme_ctrl *ctrl); +void nvme_unfreeze(struct nvme_ctrl *ctrl); +void nvme_wait_freeze(struct nvme_ctrl *ctrl); +void nvme_start_freeze(struct nvme_ctrl *ctrl); #define NVME_QID_ANY -1 struct request *nvme_alloc_request(struct request_queue *q, diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index 57a1af5..ce80cb5 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -1676,6 +1676,8 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown) del_timer_sync(&dev->watchdog_timer); mutex_lock(&dev->shutdown_lock); + if (shutdown) + nvme_start_freeze(&dev->ctrl); if (pci_is_enabled(to_pci_dev(dev->dev))) { nvme_stop_queues(&dev->ctrl); csts = readl(dev->bar + NVME_REG_CSTS); @@ -1700,6 +1702,16 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown) blk_mq_tagset_busy_iter(&dev->tagset, nvme_cancel_request, &dev->ctrl); blk_mq_tagset_busy_iter(&dev->admin_tagset, nvme_cancel_request, &dev->ctrl); + + /* + * The driver will not be starting up queues again if shutting down so + * must flush all entered requests to their failed completion to avoid + * deadlocking blk-mq hot-cpu notifier. + */ + if (shutdown) { + nvme_start_queues(&dev->ctrl); + nvme_wait_freeze(&dev->ctrl); + } mutex_unlock(&dev->shutdown_lock); } @@ -1823,6 +1835,8 @@ static void nvme_reset_work(struct work_struct *work) } else { nvme_start_queues(&dev->ctrl); nvme_dev_add(dev); + if (was_suspended) + nvme_unfreeze(&dev->ctrl); } if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_LIVE)) {
If the block layer has entered requests and gets a CPU hot plug event prior to the resume event, it will wait for those requests to exit. If the nvme driver is shutting down, it will not start the queues back up, preventing forward progress. To fix that, this patch freezes the request queues when the driver intends to shut down the controller so that no new requests may enter. After the controller has been disabled, the queues will be restarted to force all entered requests to end in failure so that blk-mq's hot cpu notifier may progress. To ensure the queue usage count is 0 on a shutdown, the driver waits for freeze to complete before completing the controller shutdown. On resume, the driver will unfreeze the queue for new requests to enter once the hardware contexts are reinitialized. Signed-off-by: Keith Busch <keith.busch@intel.com> --- v1 -> v2: Simplified the freeze and waiting, using the new blk API. drivers/nvme/host/core.c | 33 +++++++++++++++++++++++++++++++++ drivers/nvme/host/nvme.h | 3 +++ drivers/nvme/host/pci.c | 14 ++++++++++++++ 3 files changed, 50 insertions(+)