From patchwork Fri Jul 29 19:57:18 2016 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Sagi Grimberg X-Patchwork-Id: 9252797 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id D4C016075F for ; Fri, 29 Jul 2016 19:57:36 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id C5AD8283F3 for ; Fri, 29 Jul 2016 19:57:36 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id BA81228411; Fri, 29 Jul 2016 19:57:36 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.9 required=2.0 tests=BAYES_00,RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 583DE283F3 for ; Fri, 29 Jul 2016 19:57:36 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753415AbcG2T5f (ORCPT ); Fri, 29 Jul 2016 15:57:35 -0400 Received: from merlin.infradead.org ([205.233.59.134]:40343 "EHLO merlin.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751558AbcG2T5f (ORCPT ); Fri, 29 Jul 2016 15:57:35 -0400 Received: from 212.116.139.77.rev.sfr.net ([77.139.116.212] helo=bombadil.infradead.org) by merlin.infradead.org with esmtpsa (Exim 4.85_2 #1 (Red Hat Linux)) id 1bTDuU-0000SY-UY; Fri, 29 Jul 2016 19:57:31 +0000 From: Sagi Grimberg To: linux-nvme@lists.infradead.org Cc: linux-rdma@vger.kernel.org, Steve Wise , Christoph Hellwig , Jay Freyensee , Jens Axboe Subject: [PATCH 1/5] nvme-rdma: Fix device removal handling Date: Fri, 29 Jul 2016 22:57:18 +0300 Message-Id: <1469822242-3477-2-git-send-email-sagi@grimberg.me> X-Mailer: git-send-email 1.9.1 In-Reply-To: <1469822242-3477-1-git-send-email-sagi@grimberg.me> References: <1469822242-3477-1-git-send-email-sagi@grimberg.me> Sender: linux-rdma-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-rdma@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Device removal sequence may have crashed because the controller (and admin queue space) was freed before we destroyed the admin queue resources. Thus we want to destroy the admin queue and only then queue controller deletion and wait for it to complete. More specifically we: 1. own the controller deletion (make sure we are not competing with another deletion). 2. get rid of inflight reconnects if exists (which also destroy and create queues). 3. destroy the queue. 4. safely queue controller deletion (and wait for it to complete). Reported-by: Steve Wise Signed-off-by: Sagi Grimberg Reviewed-by: Christoph Hellwig --- drivers/nvme/host/rdma.c | 41 +++++++++++++++++++++-------------------- 1 file changed, 21 insertions(+), 20 deletions(-) diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c index 21ecbf3f3603..f391d67e4ee0 100644 --- a/drivers/nvme/host/rdma.c +++ b/drivers/nvme/host/rdma.c @@ -169,7 +169,6 @@ MODULE_PARM_DESC(register_always, static int nvme_rdma_cm_handler(struct rdma_cm_id *cm_id, struct rdma_cm_event *event); static void nvme_rdma_recv_done(struct ib_cq *cq, struct ib_wc *wc); -static int __nvme_rdma_del_ctrl(struct nvme_rdma_ctrl *ctrl); /* XXX: really should move to a generic header sooner or later.. */ static inline void put_unaligned_le24(u32 val, u8 *p) @@ -1318,37 +1317,39 @@ out_destroy_queue_ib: * that caught the event. Since we hold the callout until the controller * deletion is completed, we'll deadlock if the controller deletion will * call rdma_destroy_id on this queue's cm_id. Thus, we claim ownership - * of destroying this queue before-hand, destroy the queue resources - * after the controller deletion completed with the exception of destroying - * the cm_id implicitely by returning a non-zero rc to the callout. + * of destroying this queue before-hand, destroy the queue resources, + * then queue the controller deletion which won't destroy this queue and + * we destroy the cm_id implicitely by returning a non-zero rc to the callout. */ static int nvme_rdma_device_unplug(struct nvme_rdma_queue *queue) { struct nvme_rdma_ctrl *ctrl = queue->ctrl; - int ret, ctrl_deleted = 0; + int ret; - /* First disable the queue so ctrl delete won't free it */ - if (!test_and_clear_bit(NVME_RDMA_Q_CONNECTED, &queue->flags)) - goto out; + /* Own the controller deletion */ + if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_DELETING)) + return 0; - /* delete the controller */ - ret = __nvme_rdma_del_ctrl(ctrl); - if (!ret) { - dev_warn(ctrl->ctrl.device, - "Got rdma device removal event, deleting ctrl\n"); - flush_work(&ctrl->delete_work); + dev_warn(ctrl->ctrl.device, + "Got rdma device removal event, deleting ctrl\n"); - /* Return non-zero so the cm_id will destroy implicitly */ - ctrl_deleted = 1; + /* Get rid of reconnect work if its running */ + cancel_delayed_work_sync(&ctrl->reconnect_work); + /* Disable the queue so ctrl delete won't free it */ + if (test_and_clear_bit(NVME_RDMA_Q_CONNECTED, &queue->flags)) { /* Free this queue ourselves */ - rdma_disconnect(queue->cm_id); - ib_drain_qp(queue->qp); + nvme_rdma_stop_queue(queue); nvme_rdma_destroy_queue_ib(queue); + + /* Return non-zero so the cm_id will destroy implicitly */ + ret = 1; } -out: - return ctrl_deleted; + /* Queue controller deletion */ + queue_work(nvme_rdma_wq, &ctrl->delete_work); + flush_work(&ctrl->delete_work); + return ret; } static int nvme_rdma_cm_handler(struct rdma_cm_id *cm_id,