From patchwork Mon May 8 16:15:25 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ming Lei X-Patchwork-Id: 9716439 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id AFE8060234 for ; Mon, 8 May 2017 16:15:44 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id A216720008 for ; Mon, 8 May 2017 16:15:44 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 96EBE2022B; Mon, 8 May 2017 16:15:44 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.9 required=2.0 tests=BAYES_00,RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 4519C20008 for ; Mon, 8 May 2017 16:15:44 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751130AbdEHQPn (ORCPT ); Mon, 8 May 2017 12:15:43 -0400 Received: from mx1.redhat.com ([209.132.183.28]:56782 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750815AbdEHQPn (ORCPT ); Mon, 8 May 2017 12:15:43 -0400 Received: from smtp.corp.redhat.com (int-mx04.intmail.prod.int.phx2.redhat.com [10.5.11.14]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 3776CA08F9; Mon, 8 May 2017 16:15:42 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mx1.redhat.com 3776CA08F9 Authentication-Results: ext-mx10.extmail.prod.ext.phx2.redhat.com; dmarc=none (p=none dis=none) header.from=redhat.com Authentication-Results: ext-mx10.extmail.prod.ext.phx2.redhat.com; spf=pass smtp.mailfrom=ming.lei@redhat.com DKIM-Filter: OpenDKIM Filter v2.11.0 mx1.redhat.com 3776CA08F9 Received: from ming.t460p (vpn1-5-78.pek2.redhat.com [10.72.5.78]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 6B21617C48; Mon, 8 May 2017 16:15:29 +0000 (UTC) Date: Tue, 9 May 2017 00:15:25 +0800 From: Ming Lei To: Keith Busch Cc: Jens Axboe , Sagi Grimberg , stable@vger.kernel.org, linux-block@vger.kernel.org, linux-nvme@lists.infradead.org, Christoph Hellwig Subject: Re: [PATCH] nvme: remove disk after hw queue is started Message-ID: <20170508161524.GE5696@ming.t460p> References: <20170508112457.10236-1-ming.lei@redhat.com> <20170508124638.GD5696@ming.t460p> <20170508150720.GB32736@localhost.localdomain> <20170508151123.GA463@localhost.localdomain> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20170508151123.GA463@localhost.localdomain> User-Agent: Mutt/1.8.0 (2017-02-23) X-Scanned-By: MIMEDefang 2.79 on 10.5.11.14 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.39]); Mon, 08 May 2017 16:15:42 +0000 (UTC) Sender: linux-block-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP On Mon, May 08, 2017 at 11:11:24AM -0400, Keith Busch wrote: > On Mon, May 08, 2017 at 11:07:20AM -0400, Keith Busch wrote: > > I'm almost certain the remove_work shouldn't even be running in this > > case. If the reset work can't transition the controller state correctly, > > it should assume something is handling the controller. > > Here's the more complete version of what I had in mind. Does this solve > the reported issue? > > --- > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c > index 26a5fd0..46a37fb 100644 > --- a/drivers/nvme/host/pci.c > +++ b/drivers/nvme/host/pci.c > @@ -1792,7 +1792,7 @@ static void nvme_reset_work(struct work_struct *work) > nvme_dev_disable(dev, false); > > if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING)) > - goto out; > + return; > > result = nvme_pci_enable(dev); > if (result) > @@ -1854,7 +1854,7 @@ static void nvme_reset_work(struct work_struct *work) > > if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_LIVE)) { > dev_warn(dev->ctrl.device, "failed to mark controller live\n"); > - goto out; > + return; > } > > if (dev->online_queues > 1) This patch looks working, but seems any 'goto out' in this function may have rick to cause the same race too. Another solution I thought of is to kill queues earlier, what do you think about the following patch? --- Thanks, Ming diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index c8541c3dcd19..16740e8c4225 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -1892,6 +1892,7 @@ static void nvme_remove_dead_ctrl(struct nvme_dev *dev, int status) kref_get(&dev->ctrl.kref); nvme_dev_disable(dev, false); + nvme_kill_queues(&dev->ctrl); if (!schedule_work(&dev->remove_work)) nvme_put_ctrl(&dev->ctrl); } @@ -1998,7 +1999,6 @@ static void nvme_remove_dead_ctrl_work(struct work_struct *work) struct nvme_dev *dev = container_of(work, struct nvme_dev, remove_work); struct pci_dev *pdev = to_pci_dev(dev->dev); - nvme_kill_queues(&dev->ctrl); if (pci_get_drvdata(pdev)) device_release_driver(&pdev->dev); nvme_put_ctrl(&dev->ctrl);