Message ID | 20230530094322.258090-3-ming.lei@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | nvme: add nvme_delete_dead_ctrl for avoiding io deadlock | expand |
Thanks Ming, feel free to add: Tested-by: Yi Zhang <yi.zhang@redhat.com> On Tue, May 30, 2023 at 5:44 PM Ming Lei <ming.lei@redhat.com> wrote: > > Reconnect failure has been reached after trying enough times, and controller > is actually incapable of handling IO, so it should be marked as dead, so call > nvme_delete_dead_ctrl() to handle the failure for avoiding the following IO > deadlock: > > 1) writeback IO waits in __bio_queue_enter() because queue is frozen > during error recovery > > 2) reconnect failure handler removes controller, and del_gendisk() waits > for above writeback IO in fsync/invalidate bdev > > Fix the issue by calling nvme_delete_dead_ctrl() which call > nvme_mark_namespaces_dead() before deleting disk, so the above writeback > IO will be failed, and IO deadlock is avoided. > > Reported-by: Yi Zhang <yi.zhang@redhat.com> > Signed-off-by: Ming Lei <ming.lei@redhat.com> > --- > drivers/nvme/host/rdma.c | 2 +- > drivers/nvme/host/tcp.c | 2 +- > 2 files changed, 2 insertions(+), 2 deletions(-) > > diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c > index 0eb79696fb73..cdf5855c3009 100644 > --- a/drivers/nvme/host/rdma.c > +++ b/drivers/nvme/host/rdma.c > @@ -1028,7 +1028,7 @@ static void nvme_rdma_reconnect_or_remove(struct nvme_rdma_ctrl *ctrl) > queue_delayed_work(nvme_wq, &ctrl->reconnect_work, > ctrl->ctrl.opts->reconnect_delay * HZ); > } else { > - nvme_delete_ctrl(&ctrl->ctrl); > + nvme_delete_dead_ctrl(&ctrl->ctrl); > } > } > > diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c > index bf0230442d57..2c119bff7010 100644 > --- a/drivers/nvme/host/tcp.c > +++ b/drivers/nvme/host/tcp.c > @@ -2047,7 +2047,7 @@ static void nvme_tcp_reconnect_or_remove(struct nvme_ctrl *ctrl) > ctrl->opts->reconnect_delay * HZ); > } else { > dev_info(ctrl->device, "Removing controller...\n"); > - nvme_delete_ctrl(ctrl); > + nvme_delete_dead_ctrl(ctrl); > } > } > > -- > 2.40.1 >
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c index 0eb79696fb73..cdf5855c3009 100644 --- a/drivers/nvme/host/rdma.c +++ b/drivers/nvme/host/rdma.c @@ -1028,7 +1028,7 @@ static void nvme_rdma_reconnect_or_remove(struct nvme_rdma_ctrl *ctrl) queue_delayed_work(nvme_wq, &ctrl->reconnect_work, ctrl->ctrl.opts->reconnect_delay * HZ); } else { - nvme_delete_ctrl(&ctrl->ctrl); + nvme_delete_dead_ctrl(&ctrl->ctrl); } } diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c index bf0230442d57..2c119bff7010 100644 --- a/drivers/nvme/host/tcp.c +++ b/drivers/nvme/host/tcp.c @@ -2047,7 +2047,7 @@ static void nvme_tcp_reconnect_or_remove(struct nvme_ctrl *ctrl) ctrl->opts->reconnect_delay * HZ); } else { dev_info(ctrl->device, "Removing controller...\n"); - nvme_delete_ctrl(ctrl); + nvme_delete_dead_ctrl(ctrl); } }
Reconnect failure has been reached after trying enough times, and controller is actually incapable of handling IO, so it should be marked as dead, so call nvme_delete_dead_ctrl() to handle the failure for avoiding the following IO deadlock: 1) writeback IO waits in __bio_queue_enter() because queue is frozen during error recovery 2) reconnect failure handler removes controller, and del_gendisk() waits for above writeback IO in fsync/invalidate bdev Fix the issue by calling nvme_delete_dead_ctrl() which call nvme_mark_namespaces_dead() before deleting disk, so the above writeback IO will be failed, and IO deadlock is avoided. Reported-by: Yi Zhang <yi.zhang@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> --- drivers/nvme/host/rdma.c | 2 +- drivers/nvme/host/tcp.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-)