nvme: remove disk after hw queue is started

Message ID	20170508161524.GE5696@ming.t460p (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-block-owner@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mx1.redhat.com 3776CA08F9 Date: Tue, 9 May 2017 00:15:25 +0800 From: Ming Lei <ming.lei@redhat.com> To: Keith Busch <keith.busch@intel.com> Cc: Jens Axboe <axboe@kernel.dk>, Sagi Grimberg <sagi@grimberg.me>, stable@vger.kernel.org, linux-block@vger.kernel.org, linux-nvme@lists.infradead.org, Christoph Hellwig <hch@lst.de> Subject: Re: [PATCH] nvme: remove disk after hw queue is started Message-ID: <20170508161524.GE5696@ming.t460p> References: <20170508112457.10236-1-ming.lei@redhat.com> <20170508124638.GD5696@ming.t460p> <20170508150720.GB32736@localhost.localdomain> <20170508151123.GA463@localhost.localdomain> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170508151123.GA463@localhost.localdomain> User-Agent: Mutt/1.8.0 (2017-02-23) Sender: linux-block-owner@vger.kernel.org Precedence: bulk

Message ID

20170508161524.GE5696@ming.t460p (mailing list archive)

State

New, archived

Headers

DMARC-Filter: OpenDMARC Filter v1.3.2 mx1.redhat.com 3776CA08F9
Date: Tue, 9 May 2017 00:15:25 +0800
From: Ming Lei <ming.lei@redhat.com>
To: Keith Busch <keith.busch@intel.com>
Cc: Jens Axboe <axboe@kernel.dk>, Sagi Grimberg <sagi@grimberg.me>,
	stable@vger.kernel.org, linux-block@vger.kernel.org,
	linux-nvme@lists.infradead.org, Christoph Hellwig <hch@lst.de>
Subject: Re: [PATCH] nvme: remove disk after hw queue is started
Message-ID: <20170508161524.GE5696@ming.t460p>
References: <20170508112457.10236-1-ming.lei@redhat.com>
	<20170508124638.GD5696@ming.t460p>
	<20170508150720.GB32736@localhost.localdomain>
	<20170508151123.GA463@localhost.localdomain>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20170508151123.GA463@localhost.localdomain>
User-Agent: Mutt/1.8.0 (2017-02-23)
Sender: linux-block-owner@vger.kernel.org
Precedence: bulk

Commit Message

Ming Lei May 8, 2017, 4:15 p.m. UTC

On Mon, May 08, 2017 at 11:11:24AM -0400, Keith Busch wrote:
> On Mon, May 08, 2017 at 11:07:20AM -0400, Keith Busch wrote:
> > I'm almost certain the remove_work shouldn't even be running in this
> > case. If the reset work can't transition the controller state correctly,
> > it should assume something is handling the controller.
> 
> Here's the more complete version of what I had in mind. Does this solve
> the reported issue?
> 
> ---
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 26a5fd0..46a37fb 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -1792,7 +1792,7 @@ static void nvme_reset_work(struct work_struct *work)
>  		nvme_dev_disable(dev, false);
>  
>  	if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING))
> -		goto out;
> +		return;
>  
>  	result = nvme_pci_enable(dev);
>  	if (result)
> @@ -1854,7 +1854,7 @@ static void nvme_reset_work(struct work_struct *work)
>  
>  	if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_LIVE)) {
>  		dev_warn(dev->ctrl.device, "failed to mark controller live\n");
> -		goto out;
> +		return;
>  	}
>  
>  	if (dev->online_queues > 1)

This patch looks working, but seems any 'goto out' in this function
may have rick to cause the same race too.

Another solution I thought of is to kill queues earlier, what do you
think about the following patch?

---


Thanks,
Ming

Comments

Keith Busch May 8, 2017, 5:25 p.m. UTC | #1

On Tue, May 09, 2017 at 12:15:25AM +0800, Ming Lei wrote:
> This patch looks working, but seems any 'goto out' in this function
> may have rick to cause the same race too.

The goto was really intended for handling totally broken contronllers,
which isn't the case if someone requested to remove the pci device while
we're initializing it. Point taken, though, let me run a few tests and
see if there's a better way to handle this condition.

> Another solution I thought of is to kill queues earlier, what do you
> think about the following patch?

That should get it unstuck, but it will error all the IO that fsync_bdev
would probably rather complete successfully.

Question though, why doesn't the remove_work's nvme_kill_queues in
its current place allow forward progress already?

> ---
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index c8541c3dcd19..16740e8c4225 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -1892,6 +1892,7 @@ static void nvme_remove_dead_ctrl(struct nvme_dev *dev, int status)
>  
>  	kref_get(&dev->ctrl.kref);
>  	nvme_dev_disable(dev, false);
> +	nvme_kill_queues(&dev->ctrl);
>  	if (!schedule_work(&dev->remove_work))
>  		nvme_put_ctrl(&dev->ctrl);
>  }
> @@ -1998,7 +1999,6 @@ static void nvme_remove_dead_ctrl_work(struct work_struct *work)
>  	struct nvme_dev *dev = container_of(work, struct nvme_dev, remove_work);
>  	struct pci_dev *pdev = to_pci_dev(dev->dev);
>  
> -	nvme_kill_queues(&dev->ctrl);
>  	if (pci_get_drvdata(pdev))
>  		device_release_driver(&pdev->dev);
>  	nvme_put_ctrl(&dev->ctrl);

Ming Lei May 9, 2017, 1:10 a.m. UTC | #2

Hi Keith,

Thanks for looking at this issue!

On Mon, May 08, 2017 at 01:25:12PM -0400, Keith Busch wrote:
> On Tue, May 09, 2017 at 12:15:25AM +0800, Ming Lei wrote:
> > This patch looks working, but seems any 'goto out' in this function
> > may have rick to cause the same race too.
> 
> The goto was really intended for handling totally broken contronllers,
> which isn't the case if someone requested to remove the pci device while
> we're initializing it. Point taken, though, let me run a few tests and
> see if there's a better way to handle this condition.

The thing is that remove can happen any time, either from hotplug or
unbinding driver or 'echo 1 > $PCI_PATH/remove'. At the same time,
the reset can be ongoing.

Also looks the hang in del_gendisk() is fixed by this change, but I
just found a new issue which is triggered after the NVMe PCI device is rescaned
again after last remove.

[  504.135554] VFS: Dirty inode writeback failed for block device nvme0n1p1 (err=-5).

> 
> > Another solution I thought of is to kill queues earlier, what do you
> > think about the following patch?
> 
> That should get it unstuck, but it will error all the IO that fsync_bdev
> would probably rather complete successfully.

nvme_dev_disable(false) has been completed already before killing queues in
nvme_remove_dead_ctrl(), so both hw queue is stopped and nvmeq->cq_vector is
set as -1 in nvme_suspend_queue(). That means no new I/O(include IO in
fsync_bdev) can be submitted successfully any more, so looks it is reasonable
to kill queue in nvme_remove_dead_ctrl().

> 
> Question though, why doesn't the remove_work's nvme_kill_queues in
> its current place allow forward progress already?

That is because .remove_work may not be run before del_gendisk() is
started even though the .reset_work is flushed, and we can't flush
.remove_work simply here.

Thanks,
Ming

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index c8541c3dcd19..16740e8c4225 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1892,6 +1892,7 @@  static void nvme_remove_dead_ctrl(struct nvme_dev *dev, int status)
 
 	kref_get(&dev->ctrl.kref);
 	nvme_dev_disable(dev, false);
+	nvme_kill_queues(&dev->ctrl);
 	if (!schedule_work(&dev->remove_work))
 		nvme_put_ctrl(&dev->ctrl);
 }
@@ -1998,7 +1999,6 @@  static void nvme_remove_dead_ctrl_work(struct work_struct *work)
 	struct nvme_dev *dev = container_of(work, struct nvme_dev, remove_work);
 	struct pci_dev *pdev = to_pci_dev(dev->dev);
 
-	nvme_kill_queues(&dev->ctrl);
 	if (pci_get_drvdata(pdev))
 		device_release_driver(&pdev->dev);
 	nvme_put_ctrl(&dev->ctrl);

nvme: remove disk after hw queue is started

Commit Message

Comments

Patch