diff mbox

[1/2] nvme: fix race between removing and reseting failure

Message ID 20170517012729.13469-2-ming.lei@redhat.com (mailing list archive)
State New, archived
Headers show

Commit Message

Ming Lei May 17, 2017, 1:27 a.m. UTC
When one NVMe PCI device is being resetted and found reset failue,
nvme_remove_dead_ctrl() is called to handle the failure: blk-mq hw queues
are put into stopped first, then schedule .remove_work to release the driver.

Unfortunately if the driver is being released via sysfs store
just before the .remove_work is run, del_gendisk() from
nvme_remove() may hang forever because hw queues are stopped and
the submitted writeback IOs from fsync_bdev() can't be completed at all.

This patch fixes the following issue[1][2] by moving nvme_kill_queues()
into nvme_remove_dead_ctrl() to avoid the issue because nvme_remove()
flushs .reset_work, and this way is reasonable and safe because
nvme_dev_disable() has started to suspend queues and canceled requests
already.

[1] test script
	fio -filename=$NVME_DISK -iodepth=1 -thread -rw=randwrite -ioengine=psync \
	    -bssplit=5k/10:9k/10:13k/10:17k/10:21k/10:25k/10:29k/10:33k/10:37k/10:41k/10 \
	    -bs_unaligned -runtime=1200 -size=-group_reporting -name=mytest -numjobs=60

	sleep 35
	echo 1 > $SYSFS_NVME_PCI_PATH/rescan
	echo 1 > $SYSFS_NVME_PCI_PATH/reset
	echo 1 > $SYSFS_NVME_PCI_PATH/remove
	echo 1 > /sys/bus/pci/rescan

[2] kernel hang log
[  492.232593] INFO: task nvme-test:5939 blocked for more than 120 seconds.
[  492.240081]       Not tainted 4.11.0.nvme_v4.11_debug_hang+ #3
[  492.246600] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  492.255346] nvme-test       D    0  5939   5938 0x00000080
[  492.261475] Call Trace:
[  492.264215]  __schedule+0x289/0x8f0
[  492.268105]  ? write_cache_pages+0x14c/0x510
[  492.272873]  schedule+0x36/0x80
[  492.276381]  io_schedule+0x16/0x40
[  492.280181]  wait_on_page_bit_common+0x137/0x220
[  492.285336]  ? page_cache_tree_insert+0x120/0x120
[  492.290589]  __filemap_fdatawait_range+0x128/0x1a0
[  492.295941]  filemap_fdatawait_range+0x14/0x30
[  492.300902]  filemap_fdatawait+0x23/0x30
[  492.305282]  filemap_write_and_wait+0x4c/0x80
[  492.310151]  __sync_blockdev+0x1f/0x40
[  492.314336]  fsync_bdev+0x44/0x50
[  492.318039]  invalidate_partition+0x24/0x50
[  492.322710]  del_gendisk+0xcd/0x2e0
[  492.326608]  nvme_ns_remove+0x105/0x130 [nvme_core]
[  492.332054]  nvme_remove_namespaces+0x32/0x50 [nvme_core]
[  492.338082]  nvme_uninit_ctrl+0x2d/0xa0 [nvme_core]
[  492.343519]  nvme_remove+0x5d/0x170 [nvme]
[  492.348096]  pci_device_remove+0x39/0xc0
[  492.352477]  device_release_driver_internal+0x141/0x1f0
[  492.358311]  device_release_driver+0x12/0x20
[  492.363072]  pci_stop_bus_device+0x8c/0xa0
[  492.367646]  pci_stop_and_remove_bus_device_locked+0x1a/0x30
[  492.373965]  remove_store+0x7c/0x90
[  492.377852]  dev_attr_store+0x18/0x30
[  492.381941]  sysfs_kf_write+0x3a/0x50
[  492.386028]  kernfs_fop_write+0xff/0x180
[  492.390409]  __vfs_write+0x37/0x160
[  492.394304]  ? selinux_file_permission+0xe5/0x120
[  492.399556]  ? security_file_permission+0x3b/0xc0
[  492.404807]  vfs_write+0xb2/0x1b0
[  492.408508]  ? syscall_trace_enter+0x1d0/0x2b0
[  492.413462]  SyS_write+0x55/0xc0
[  492.417064]  do_syscall_64+0x67/0x180
[  492.421155]  entry_SYSCALL64_slow_path+0x25/0x25

Cc: stable@vger.kernel.org
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/nvme/host/pci.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

Comments

Johannes Thumshirn May 17, 2017, 6:38 a.m. UTC | #1
On 05/17/2017 03:27 AM, Ming Lei wrote:
> When one NVMe PCI device is being resetted and found reset failue,
> nvme_remove_dead_ctrl() is called to handle the failure: blk-mq hw queues
> are put into stopped first, then schedule .remove_work to release the driver.
> 
> Unfortunately if the driver is being released via sysfs store
> just before the .remove_work is run, del_gendisk() from
> nvme_remove() may hang forever because hw queues are stopped and
> the submitted writeback IOs from fsync_bdev() can't be completed at all.
> 
> This patch fixes the following issue[1][2] by moving nvme_kill_queues()
> into nvme_remove_dead_ctrl() to avoid the issue because nvme_remove()
> flushs .reset_work, and this way is reasonable and safe because
> nvme_dev_disable() has started to suspend queues and canceled requests
> already.
> 
> [1] test script
> 	fio -filename=$NVME_DISK -iodepth=1 -thread -rw=randwrite -ioengine=psync \
> 	    -bssplit=5k/10:9k/10:13k/10:17k/10:21k/10:25k/10:29k/10:33k/10:37k/10:41k/10 \
> 	    -bs_unaligned -runtime=1200 -size=-group_reporting -name=mytest -numjobs=60

Nit: the actual size after the -size parameter is missing.

Anyways:
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Ming Lei May 17, 2017, 7:01 a.m. UTC | #2
On Wed, May 17, 2017 at 08:38:01AM +0200, Johannes Thumshirn wrote:
> On 05/17/2017 03:27 AM, Ming Lei wrote:
> > When one NVMe PCI device is being resetted and found reset failue,
> > nvme_remove_dead_ctrl() is called to handle the failure: blk-mq hw queues
> > are put into stopped first, then schedule .remove_work to release the driver.
> > 
> > Unfortunately if the driver is being released via sysfs store
> > just before the .remove_work is run, del_gendisk() from
> > nvme_remove() may hang forever because hw queues are stopped and
> > the submitted writeback IOs from fsync_bdev() can't be completed at all.
> > 
> > This patch fixes the following issue[1][2] by moving nvme_kill_queues()
> > into nvme_remove_dead_ctrl() to avoid the issue because nvme_remove()
> > flushs .reset_work, and this way is reasonable and safe because
> > nvme_dev_disable() has started to suspend queues and canceled requests
> > already.
> > 
> > [1] test script
> > 	fio -filename=$NVME_DISK -iodepth=1 -thread -rw=randwrite -ioengine=psync \
> > 	    -bssplit=5k/10:9k/10:13k/10:17k/10:21k/10:25k/10:29k/10:33k/10:37k/10:41k/10 \
> > 	    -bs_unaligned -runtime=1200 -size=-group_reporting -name=mytest -numjobs=60
> 
> Nit: the actual size after the -size parameter is missing.

Forget to mention, $NVME_DISK has to be one partition of nvme disk, and
the type need to be 'filesystem' type for reproduction, then actual size
isn't needed.

> 
> Anyways:
> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>

Thanks for review!


Thanks,
Ming
Christoph Hellwig May 18, 2017, 1:47 p.m. UTC | #3
On Wed, May 17, 2017 at 09:27:28AM +0800, Ming Lei wrote:
> When one NVMe PCI device is being resetted and found reset failue,
> nvme_remove_dead_ctrl() is called to handle the failure: blk-mq hw queues
> are put into stopped first, then schedule .remove_work to release the driver.
> 
> Unfortunately if the driver is being released via sysfs store
> just before the .remove_work is run, del_gendisk() from
> nvme_remove() may hang forever because hw queues are stopped and
> the submitted writeback IOs from fsync_bdev() can't be completed at all.
> 
> This patch fixes the following issue[1][2] by moving nvme_kill_queues()
> into nvme_remove_dead_ctrl() to avoid the issue because nvme_remove()
> flushs .reset_work, and this way is reasonable and safe because
> nvme_dev_disable() has started to suspend queues and canceled requests
> already.
> 
> [1] test script
> 	fio -filename=$NVME_DISK -iodepth=1 -thread -rw=randwrite -ioengine=psync \
> 	    -bssplit=5k/10:9k/10:13k/10:17k/10:21k/10:25k/10:29k/10:33k/10:37k/10:41k/10 \
> 	    -bs_unaligned -runtime=1200 -size=-group_reporting -name=mytest -numjobs=60
> 
> 	sleep 35
> 	echo 1 > $SYSFS_NVME_PCI_PATH/rescan
> 	echo 1 > $SYSFS_NVME_PCI_PATH/reset
> 	echo 1 > $SYSFS_NVME_PCI_PATH/remove
> 	echo 1 > /sys/bus/pci/rescan
> 
> [2] kernel hang log
> [  492.232593] INFO: task nvme-test:5939 blocked for more than 120 seconds.
> [  492.240081]       Not tainted 4.11.0.nvme_v4.11_debug_hang+ #3
> [  492.246600] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [  492.255346] nvme-test       D    0  5939   5938 0x00000080
> [  492.261475] Call Trace:
> [  492.264215]  __schedule+0x289/0x8f0
> [  492.268105]  ? write_cache_pages+0x14c/0x510
> [  492.272873]  schedule+0x36/0x80
> [  492.276381]  io_schedule+0x16/0x40
> [  492.280181]  wait_on_page_bit_common+0x137/0x220
> [  492.285336]  ? page_cache_tree_insert+0x120/0x120
> [  492.290589]  __filemap_fdatawait_range+0x128/0x1a0
> [  492.295941]  filemap_fdatawait_range+0x14/0x30
> [  492.300902]  filemap_fdatawait+0x23/0x30
> [  492.305282]  filemap_write_and_wait+0x4c/0x80
> [  492.310151]  __sync_blockdev+0x1f/0x40
> [  492.314336]  fsync_bdev+0x44/0x50
> [  492.318039]  invalidate_partition+0x24/0x50
> [  492.322710]  del_gendisk+0xcd/0x2e0
> [  492.326608]  nvme_ns_remove+0x105/0x130 [nvme_core]
> [  492.332054]  nvme_remove_namespaces+0x32/0x50 [nvme_core]
> [  492.338082]  nvme_uninit_ctrl+0x2d/0xa0 [nvme_core]
> [  492.343519]  nvme_remove+0x5d/0x170 [nvme]
> [  492.348096]  pci_device_remove+0x39/0xc0
> [  492.352477]  device_release_driver_internal+0x141/0x1f0
> [  492.358311]  device_release_driver+0x12/0x20
> [  492.363072]  pci_stop_bus_device+0x8c/0xa0
> [  492.367646]  pci_stop_and_remove_bus_device_locked+0x1a/0x30
> [  492.373965]  remove_store+0x7c/0x90
> [  492.377852]  dev_attr_store+0x18/0x30
> [  492.381941]  sysfs_kf_write+0x3a/0x50
> [  492.386028]  kernfs_fop_write+0xff/0x180
> [  492.390409]  __vfs_write+0x37/0x160
> [  492.394304]  ? selinux_file_permission+0xe5/0x120
> [  492.399556]  ? security_file_permission+0x3b/0xc0
> [  492.404807]  vfs_write+0xb2/0x1b0
> [  492.408508]  ? syscall_trace_enter+0x1d0/0x2b0
> [  492.413462]  SyS_write+0x55/0xc0
> [  492.417064]  do_syscall_64+0x67/0x180
> [  492.421155]  entry_SYSCALL64_slow_path+0x25/0x25
> 
> Cc: stable@vger.kernel.org
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  drivers/nvme/host/pci.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index fed803232edc..5e39abe57c56 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -1887,6 +1887,14 @@ static void nvme_remove_dead_ctrl(struct nvme_dev *dev, int status)
>  
>  	kref_get(&dev->ctrl.kref);
>  	nvme_dev_disable(dev, false);

> +	/*
> +	 * nvme_dev_disable() has suspended queues, then no new I/O
> +	 * can be submitted to hardware successfully any more, so
> +	 * kill queues now for avoiding race between reset failure
> +	 * and remove.

How about this instead:

	/*
	 * nvme_dev_disable() has suspended the I/O queues and no new I/O can
	 * be submitted now.  Kill the queues now to avoid races between a
	 * possible reset failure and the controller removal work.
	 */

Otherwise this looks fine to me:

Reviewed-by: Christoph Hellwig <hch@lst.de>
Keith Busch May 18, 2017, 2:13 p.m. UTC | #4
On Wed, May 17, 2017 at 09:27:28AM +0800, Ming Lei wrote:
> When one NVMe PCI device is being resetted and found reset failue,
> nvme_remove_dead_ctrl() is called to handle the failure: blk-mq hw queues
> are put into stopped first, then schedule .remove_work to release the driver.
> 
> Unfortunately if the driver is being released via sysfs store
> just before the .remove_work is run, del_gendisk() from
> nvme_remove() may hang forever because hw queues are stopped and
> the submitted writeback IOs from fsync_bdev() can't be completed at all.
> 
> This patch fixes the following issue[1][2] by moving nvme_kill_queues()
> into nvme_remove_dead_ctrl() to avoid the issue because nvme_remove()
> flushs .reset_work, and this way is reasonable and safe because
> nvme_dev_disable() has started to suspend queues and canceled requests
> already.

I'm still not sure moving where we kill the queues is the correct way
to fix this problem. The nvme_kill_queues restarts all the hardware
queues to force all IO to failure already, so why is this really stuck?
We should be able to make forward progress even if we kill the queues
while calling into del_gendisk, right? That could happen with a different
sequence of events, so that also needs to work.
Ming Lei May 18, 2017, 3:04 p.m. UTC | #5
On Thu, May 18, 2017 at 03:47:34PM +0200, Christoph Hellwig wrote:
> On Wed, May 17, 2017 at 09:27:28AM +0800, Ming Lei wrote:
> > When one NVMe PCI device is being resetted and found reset failue,
> > nvme_remove_dead_ctrl() is called to handle the failure: blk-mq hw queues
> > are put into stopped first, then schedule .remove_work to release the driver.
> > 
> > Unfortunately if the driver is being released via sysfs store
> > just before the .remove_work is run, del_gendisk() from
> > nvme_remove() may hang forever because hw queues are stopped and
> > the submitted writeback IOs from fsync_bdev() can't be completed at all.
> > 
> > This patch fixes the following issue[1][2] by moving nvme_kill_queues()
> > into nvme_remove_dead_ctrl() to avoid the issue because nvme_remove()
> > flushs .reset_work, and this way is reasonable and safe because
> > nvme_dev_disable() has started to suspend queues and canceled requests
> > already.
> > 
> > [1] test script
> > 	fio -filename=$NVME_DISK -iodepth=1 -thread -rw=randwrite -ioengine=psync \
> > 	    -bssplit=5k/10:9k/10:13k/10:17k/10:21k/10:25k/10:29k/10:33k/10:37k/10:41k/10 \
> > 	    -bs_unaligned -runtime=1200 -size=-group_reporting -name=mytest -numjobs=60
> > 
> > 	sleep 35
> > 	echo 1 > $SYSFS_NVME_PCI_PATH/rescan
> > 	echo 1 > $SYSFS_NVME_PCI_PATH/reset
> > 	echo 1 > $SYSFS_NVME_PCI_PATH/remove
> > 	echo 1 > /sys/bus/pci/rescan
> > 
> > [2] kernel hang log
> > [  492.232593] INFO: task nvme-test:5939 blocked for more than 120 seconds.
> > [  492.240081]       Not tainted 4.11.0.nvme_v4.11_debug_hang+ #3
> > [  492.246600] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > [  492.255346] nvme-test       D    0  5939   5938 0x00000080
> > [  492.261475] Call Trace:
> > [  492.264215]  __schedule+0x289/0x8f0
> > [  492.268105]  ? write_cache_pages+0x14c/0x510
> > [  492.272873]  schedule+0x36/0x80
> > [  492.276381]  io_schedule+0x16/0x40
> > [  492.280181]  wait_on_page_bit_common+0x137/0x220
> > [  492.285336]  ? page_cache_tree_insert+0x120/0x120
> > [  492.290589]  __filemap_fdatawait_range+0x128/0x1a0
> > [  492.295941]  filemap_fdatawait_range+0x14/0x30
> > [  492.300902]  filemap_fdatawait+0x23/0x30
> > [  492.305282]  filemap_write_and_wait+0x4c/0x80
> > [  492.310151]  __sync_blockdev+0x1f/0x40
> > [  492.314336]  fsync_bdev+0x44/0x50
> > [  492.318039]  invalidate_partition+0x24/0x50
> > [  492.322710]  del_gendisk+0xcd/0x2e0
> > [  492.326608]  nvme_ns_remove+0x105/0x130 [nvme_core]
> > [  492.332054]  nvme_remove_namespaces+0x32/0x50 [nvme_core]
> > [  492.338082]  nvme_uninit_ctrl+0x2d/0xa0 [nvme_core]
> > [  492.343519]  nvme_remove+0x5d/0x170 [nvme]
> > [  492.348096]  pci_device_remove+0x39/0xc0
> > [  492.352477]  device_release_driver_internal+0x141/0x1f0
> > [  492.358311]  device_release_driver+0x12/0x20
> > [  492.363072]  pci_stop_bus_device+0x8c/0xa0
> > [  492.367646]  pci_stop_and_remove_bus_device_locked+0x1a/0x30
> > [  492.373965]  remove_store+0x7c/0x90
> > [  492.377852]  dev_attr_store+0x18/0x30
> > [  492.381941]  sysfs_kf_write+0x3a/0x50
> > [  492.386028]  kernfs_fop_write+0xff/0x180
> > [  492.390409]  __vfs_write+0x37/0x160
> > [  492.394304]  ? selinux_file_permission+0xe5/0x120
> > [  492.399556]  ? security_file_permission+0x3b/0xc0
> > [  492.404807]  vfs_write+0xb2/0x1b0
> > [  492.408508]  ? syscall_trace_enter+0x1d0/0x2b0
> > [  492.413462]  SyS_write+0x55/0xc0
> > [  492.417064]  do_syscall_64+0x67/0x180
> > [  492.421155]  entry_SYSCALL64_slow_path+0x25/0x25
> > 
> > Cc: stable@vger.kernel.org
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >  drivers/nvme/host/pci.c | 9 ++++++++-
> >  1 file changed, 8 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> > index fed803232edc..5e39abe57c56 100644
> > --- a/drivers/nvme/host/pci.c
> > +++ b/drivers/nvme/host/pci.c
> > @@ -1887,6 +1887,14 @@ static void nvme_remove_dead_ctrl(struct nvme_dev *dev, int status)
> >  
> >  	kref_get(&dev->ctrl.kref);
> >  	nvme_dev_disable(dev, false);
> 
> > +	/*
> > +	 * nvme_dev_disable() has suspended queues, then no new I/O
> > +	 * can be submitted to hardware successfully any more, so
> > +	 * kill queues now for avoiding race between reset failure
> > +	 * and remove.
> 
> How about this instead:
> 
> 	/*
> 	 * nvme_dev_disable() has suspended the I/O queues and no new I/O can
> 	 * be submitted now.  Kill the queues now to avoid races between a
> 	 * possible reset failure and the controller removal work.
> 	 */

That is fine, will change to this.

Thanks,
Ming
Ming Lei May 19, 2017, 12:52 p.m. UTC | #6
On Thu, May 18, 2017 at 10:13:07AM -0400, Keith Busch wrote:
> On Wed, May 17, 2017 at 09:27:28AM +0800, Ming Lei wrote:
> > When one NVMe PCI device is being resetted and found reset failue,
> > nvme_remove_dead_ctrl() is called to handle the failure: blk-mq hw queues
> > are put into stopped first, then schedule .remove_work to release the driver.
> > 
> > Unfortunately if the driver is being released via sysfs store
> > just before the .remove_work is run, del_gendisk() from
> > nvme_remove() may hang forever because hw queues are stopped and
> > the submitted writeback IOs from fsync_bdev() can't be completed at all.
> > 
> > This patch fixes the following issue[1][2] by moving nvme_kill_queues()
> > into nvme_remove_dead_ctrl() to avoid the issue because nvme_remove()
> > flushs .reset_work, and this way is reasonable and safe because
> > nvme_dev_disable() has started to suspend queues and canceled requests
> > already.
> 
> I'm still not sure moving where we kill the queues is the correct way
> to fix this problem. The nvme_kill_queues restarts all the hardware
> queues to force all IO to failure already, so why is this really stuck?

That is a good question.

Today I investigate further, and figured out that it is because that
blk_mq_start_stopped_hw_queues() in nvme_kill_queues() does not
run hw queues actually becasue the queues are started in
nvme_reset_work() already. Will figure out a patch later to fix the
issue.

And the reason why this patch 'fixes' the issue is that just timing,
I guess.

> We should be able to make forward progress even if we kill the queues
> while calling into del_gendisk, right? That could happen with a different
> sequence of events, so that also needs to work.

Now I am not a big fun of this patch for fixing the issue.

But I still think it may be better to move nvme_kill_queues() into
nvme_remove_dead_ctrl() as an improvement because during this small
window page cache can be used up by write application, and no writeback
can move on meantime.

Thanks,
Ming
Jens Axboe May 19, 2017, 2:41 p.m. UTC | #7
On 05/16/2017 07:27 PM, Ming Lei wrote:
> When one NVMe PCI device is being resetted and found reset failue,
> nvme_remove_dead_ctrl() is called to handle the failure: blk-mq hw queues
> are put into stopped first, then schedule .remove_work to release the driver.
> 
> Unfortunately if the driver is being released via sysfs store
> just before the .remove_work is run, del_gendisk() from
> nvme_remove() may hang forever because hw queues are stopped and
> the submitted writeback IOs from fsync_bdev() can't be completed at all.
> 
> This patch fixes the following issue[1][2] by moving nvme_kill_queues()
> into nvme_remove_dead_ctrl() to avoid the issue because nvme_remove()
> flushs .reset_work, and this way is reasonable and safe because
> nvme_dev_disable() has started to suspend queues and canceled requests
> already.
> 
> [1] test script
> 	fio -filename=$NVME_DISK -iodepth=1 -thread -rw=randwrite -ioengine=psync \
> 	    -bssplit=5k/10:9k/10:13k/10:17k/10:21k/10:25k/10:29k/10:33k/10:37k/10:41k/10 \
> 	    -bs_unaligned -runtime=1200 -size=-group_reporting -name=mytest -numjobs=60
> 
> 	sleep 35
> 	echo 1 > $SYSFS_NVME_PCI_PATH/rescan
> 	echo 1 > $SYSFS_NVME_PCI_PATH/reset
> 	echo 1 > $SYSFS_NVME_PCI_PATH/remove
> 	echo 1 > /sys/bus/pci/rescan

The patch looks good to me. But since you have a nice reproducer, how about
turning that into a blktests [1] test case?

[1] https://github.com/osandov/blktests
Ming Lei May 19, 2017, 3:10 p.m. UTC | #8
On Fri, May 19, 2017 at 08:41:13AM -0600, Jens Axboe wrote:
> On 05/16/2017 07:27 PM, Ming Lei wrote:
> > When one NVMe PCI device is being resetted and found reset failue,
> > nvme_remove_dead_ctrl() is called to handle the failure: blk-mq hw queues
> > are put into stopped first, then schedule .remove_work to release the driver.
> > 
> > Unfortunately if the driver is being released via sysfs store
> > just before the .remove_work is run, del_gendisk() from
> > nvme_remove() may hang forever because hw queues are stopped and
> > the submitted writeback IOs from fsync_bdev() can't be completed at all.
> > 
> > This patch fixes the following issue[1][2] by moving nvme_kill_queues()
> > into nvme_remove_dead_ctrl() to avoid the issue because nvme_remove()
> > flushs .reset_work, and this way is reasonable and safe because
> > nvme_dev_disable() has started to suspend queues and canceled requests
> > already.
> > 
> > [1] test script
> > 	fio -filename=$NVME_DISK -iodepth=1 -thread -rw=randwrite -ioengine=psync \
> > 	    -bssplit=5k/10:9k/10:13k/10:17k/10:21k/10:25k/10:29k/10:33k/10:37k/10:41k/10 \
> > 	    -bs_unaligned -runtime=1200 -size=-group_reporting -name=mytest -numjobs=60
> > 
> > 	sleep 35
> > 	echo 1 > $SYSFS_NVME_PCI_PATH/rescan
> > 	echo 1 > $SYSFS_NVME_PCI_PATH/reset
> > 	echo 1 > $SYSFS_NVME_PCI_PATH/remove
> > 	echo 1 > /sys/bus/pci/rescan
> 
> The patch looks good to me. But since you have a nice reproducer, how about
> turning that into a blktests [1] test case?

This test has triggered several problems recently, and looks a good idea to
integrate into blktests.

I am a little busy recently, and may take a while to start that. If
anyone would like to integrate the case, please go ahead, and I am happy
to provide any details you need.


Thanks,
Ming
Keith Busch May 19, 2017, 3:15 p.m. UTC | #9
On Fri, May 19, 2017 at 08:52:45PM +0800, Ming Lei wrote:
> But I still think it may be better to move nvme_kill_queues() into
> nvme_remove_dead_ctrl() as an improvement because during this small
> window page cache can be used up by write application, and no writeback
> can move on meantime.

Yes, I agree that's a better placement for it. I was just concerned
about the reasoning since it would also mean we're still stuck if an
IO timeout occurs while calling del_gendisk. So I'm okay with the patch
as-is, but I'll also look into my other concern.
Ming Lei May 19, 2017, 4:40 p.m. UTC | #10
On Fri, May 19, 2017 at 08:41:13AM -0600, Jens Axboe wrote:
> On 05/16/2017 07:27 PM, Ming Lei wrote:
> > When one NVMe PCI device is being resetted and found reset failue,
> > nvme_remove_dead_ctrl() is called to handle the failure: blk-mq hw queues
> > are put into stopped first, then schedule .remove_work to release the driver.
> > 
> > Unfortunately if the driver is being released via sysfs store
> > just before the .remove_work is run, del_gendisk() from
> > nvme_remove() may hang forever because hw queues are stopped and
> > the submitted writeback IOs from fsync_bdev() can't be completed at all.
> > 
> > This patch fixes the following issue[1][2] by moving nvme_kill_queues()
> > into nvme_remove_dead_ctrl() to avoid the issue because nvme_remove()
> > flushs .reset_work, and this way is reasonable and safe because
> > nvme_dev_disable() has started to suspend queues and canceled requests
> > already.
> > 
> > [1] test script
> > 	fio -filename=$NVME_DISK -iodepth=1 -thread -rw=randwrite -ioengine=psync \
> > 	    -bssplit=5k/10:9k/10:13k/10:17k/10:21k/10:25k/10:29k/10:33k/10:37k/10:41k/10 \
> > 	    -bs_unaligned -runtime=1200 -size=-group_reporting -name=mytest -numjobs=60
> > 
> > 	sleep 35
> > 	echo 1 > $SYSFS_NVME_PCI_PATH/rescan
> > 	echo 1 > $SYSFS_NVME_PCI_PATH/reset
> > 	echo 1 > $SYSFS_NVME_PCI_PATH/remove
> > 	echo 1 > /sys/bus/pci/rescan
> 
> The patch looks good to me. But since you have a nice reproducer, how about
> turning that into a blktests [1] test case?
> 
> [1] https://github.com/osandov/blktests

Forget to mention, this test case is written by Zhang Yi.

Zhang Yi, maybe you can try to integrate your NVMe test case into
blktests if you are interested, :-)

Thanks,
Ming
Yi Zhang May 19, 2017, 4:55 p.m. UTC | #11
On 05/20/2017 12:40 AM, Ming Lei wrote:
> On Fri, May 19, 2017 at 08:41:13AM -0600, Jens Axboe wrote:
>> On 05/16/2017 07:27 PM, Ming Lei wrote:
>>> When one NVMe PCI device is being resetted and found reset failue,
>>> nvme_remove_dead_ctrl() is called to handle the failure: blk-mq hw queues
>>> are put into stopped first, then schedule .remove_work to release the driver.
>>>
>>> Unfortunately if the driver is being released via sysfs store
>>> just before the .remove_work is run, del_gendisk() from
>>> nvme_remove() may hang forever because hw queues are stopped and
>>> the submitted writeback IOs from fsync_bdev() can't be completed at all.
>>>
>>> This patch fixes the following issue[1][2] by moving nvme_kill_queues()
>>> into nvme_remove_dead_ctrl() to avoid the issue because nvme_remove()
>>> flushs .reset_work, and this way is reasonable and safe because
>>> nvme_dev_disable() has started to suspend queues and canceled requests
>>> already.
>>>
>>> [1] test script
>>> 	fio -filename=$NVME_DISK -iodepth=1 -thread -rw=randwrite -ioengine=psync \
>>> 	    -bssplit=5k/10:9k/10:13k/10:17k/10:21k/10:25k/10:29k/10:33k/10:37k/10:41k/10 \
>>> 	    -bs_unaligned -runtime=1200 -size=-group_reporting -name=mytest -numjobs=60
>>>
>>> 	sleep 35
>>> 	echo 1 > $SYSFS_NVME_PCI_PATH/rescan
>>> 	echo 1 > $SYSFS_NVME_PCI_PATH/reset
>>> 	echo 1 > $SYSFS_NVME_PCI_PATH/remove
>>> 	echo 1 > /sys/bus/pci/rescan
>> The patch looks good to me. But since you have a nice reproducer, how about
>> turning that into a blktests [1] test case?
>>
>> [1] https://github.com/osandov/blktests
> Forget to mention, this test case is written by Zhang Yi.
>
> Zhang Yi, maybe you can try to integrate your NVMe test case into
> blktests if you are interested, :-)
Sure, I will have a try. :)

Thanks
Yi
> Thanks,
> Ming
>
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme
diff mbox

Patch

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index fed803232edc..5e39abe57c56 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1887,6 +1887,14 @@  static void nvme_remove_dead_ctrl(struct nvme_dev *dev, int status)
 
 	kref_get(&dev->ctrl.kref);
 	nvme_dev_disable(dev, false);
+
+	/*
+	 * nvme_dev_disable() has suspended queues, then no new I/O
+	 * can be submitted to hardware successfully any more, so
+	 * kill queues now for avoiding race between reset failure
+	 * and remove.
+	 */
+	nvme_kill_queues(&dev->ctrl);
 	if (!schedule_work(&dev->remove_work))
 		nvme_put_ctrl(&dev->ctrl);
 }
@@ -1993,7 +2001,6 @@  static void nvme_remove_dead_ctrl_work(struct work_struct *work)
 	struct nvme_dev *dev = container_of(work, struct nvme_dev, remove_work);
 	struct pci_dev *pdev = to_pci_dev(dev->dev);
 
-	nvme_kill_queues(&dev->ctrl);
 	if (pci_get_drvdata(pdev))
 		device_release_driver(&pdev->dev);
 	nvme_put_ctrl(&dev->ctrl);