From patchwork Tue May  9 03:26:59 2017
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Ming Lei <ming.lei@redhat.com>
X-Patchwork-Id: 9716997
Return-Path: <linux-block-owner@kernel.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
	[172.30.200.125])
	by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id
	5A29660364 for <patchwork-linux-block@patchwork.kernel.org>;
	Tue,  9 May 2017 03:27:18 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 4292720243
	for <patchwork-linux-block@patchwork.kernel.org>;
	Tue,  9 May 2017 03:27:18 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id 3713A24B5B; Tue,  9 May 2017 03:27:18 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-6.9 required=2.0 tests=BAYES_00,RCVD_IN_DNSWL_HI
	autolearn=ham version=3.3.1
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id C8EC920243
	for <patchwork-linux-block@patchwork.kernel.org>;
	Tue,  9 May 2017 03:27:17 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1756077AbdEID1P (ORCPT
	<rfc822;patchwork-linux-block@patchwork.kernel.org>);
	Mon, 8 May 2017 23:27:15 -0400
Received: from mx1.redhat.com ([209.132.183.28]:48270 "EHLO mx1.redhat.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1755549AbdEID1O (ORCPT <rfc822;linux-block@vger.kernel.org>);
	Mon, 8 May 2017 23:27:14 -0400
Received: from smtp.corp.redhat.com
	(int-mx01.intmail.prod.int.phx2.redhat.com [10.5.11.11])
	(using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
	(No client certificate requested)
	by mx1.redhat.com (Postfix) with ESMTPS id 6438781222;
	Tue,  9 May 2017 03:27:13 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mx1.redhat.com 6438781222
Authentication-Results: ext-mx01.extmail.prod.ext.phx2.redhat.com;
	dmarc=none (p=none dis=none) header.from=redhat.com
Authentication-Results: ext-mx01.extmail.prod.ext.phx2.redhat.com;
	spf=pass smtp.mailfrom=ming.lei@redhat.com
DKIM-Filter: OpenDKIM Filter v2.11.0 mx1.redhat.com 6438781222
Received: from ming.t460p (vpn1-5-78.pek2.redhat.com [10.72.5.78])
	by smtp.corp.redhat.com (Postfix) with ESMTPS id 04B9019647;
	Tue,  9 May 2017 03:27:03 +0000 (UTC)
Date: Tue, 9 May 2017 11:26:59 +0800
From: Ming Lei <ming.lei@redhat.com>
To: Keith Busch <keith.busch@intel.com>
Cc: Jens Axboe <axboe@kernel.dk>, Sagi Grimberg <sagi@grimberg.me>,
	stable@vger.kernel.org, linux-block@vger.kernel.org,
	linux-nvme@lists.infradead.org, Christoph Hellwig <hch@lst.de>
Subject: Re: [PATCH] nvme: remove disk after hw queue is started
Message-ID: <20170509032654.GA15037@ming.t460p>
References: <20170508112457.10236-1-ming.lei@redhat.com>
	<20170508124638.GD5696@ming.t460p>
	<20170508150720.GB32736@localhost.localdomain>
	<20170508151123.GA463@localhost.localdomain>
	<20170508161524.GE5696@ming.t460p>
	<20170508172511.GA1750@localhost.localdomain>
	<20170509011025.GA13210@ming.t460p>
MIME-Version: 1.0
Content-Disposition: inline
In-Reply-To: <20170509011025.GA13210@ming.t460p>
User-Agent: Mutt/1.8.0 (2017-02-23)
X-Scanned-By: MIMEDefang 2.79 on 10.5.11.11
X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16
	(mx1.redhat.com [10.5.110.25]);
	Tue, 09 May 2017 03:27:13 +0000 (UTC)
Sender: linux-block-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-block.vger.kernel.org>
X-Mailing-List: linux-block@vger.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

Hi Keith.

On Tue, May 09, 2017 at 09:10:30AM +0800, Ming Lei wrote:
> Hi Keith,
> 
> Thanks for looking at this issue!
> 
> On Mon, May 08, 2017 at 01:25:12PM -0400, Keith Busch wrote:
> > On Tue, May 09, 2017 at 12:15:25AM +0800, Ming Lei wrote:
> > > This patch looks working, but seems any 'goto out' in this function
> > > may have rick to cause the same race too.
> > 
> > The goto was really intended for handling totally broken contronllers,
> > which isn't the case if someone requested to remove the pci device while
> > we're initializing it. Point taken, though, let me run a few tests and
> > see if there's a better way to handle this condition.
> 
> The thing is that remove can happen any time, either from hotplug or
> unbinding driver or 'echo 1 > $PCI_PATH/remove'. At the same time,
> the reset can be ongoing.
> 
> Also looks the hang in del_gendisk() is fixed by this change, but I
> just found a new issue which is triggered after the NVMe PCI device is rescaned
> again after last remove.
> 
> [  504.135554] VFS: Dirty inode writeback failed for block device nvme0n1p1 (err=-5).
> 
> > 
> > > Another solution I thought of is to kill queues earlier, what do you
> > > think about the following patch?
> > 
> > That should get it unstuck, but it will error all the IO that fsync_bdev
> > would probably rather complete successfully.
> 
> nvme_dev_disable(false) has been completed already before killing queues in
> nvme_remove_dead_ctrl(), so both hw queue is stopped and nvmeq->cq_vector is
> set as -1 in nvme_suspend_queue(). That means no new I/O(include IO in
> fsync_bdev) can be submitted successfully any more, so looks it is reasonable
> to kill queue in nvme_remove_dead_ctrl().
> 
> > 
> > Question though, why doesn't the remove_work's nvme_kill_queues in
> > its current place allow forward progress already?
> 
> That is because .remove_work may not be run before del_gendisk() is
> started even though the .reset_work is flushed, and we can't flush
> .remove_work simply here.

The following patch should be a complete version, and the io stress test
with pci reset/remove has been survived for hours, and previously
the hang can be reproduced in at most half an hour.
---


Thanks,
Ming

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index d5e0906262ea..f74cdf8e710f 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -2097,6 +2097,15 @@ static void nvme_ns_remove(struct nvme_ns *ns)
 					&nvme_ns_attr_group);
 		if (ns->ndev)
 			nvme_nvm_unregister_sysfs(ns);
+
+		/*
+		 * If queue is dead, we have to abort requests in
+		 * requeue list first because fsync_bdev() in removing disk
+		 * path may wait for these IOs, which can't
+		 * be submitted to hardware too.
+		 */
+		if (blk_queue_dying(ns->queue))
+			blk_mq_abort_requeue_list(ns->queue);
 		del_gendisk(ns->disk);
 		blk_mq_abort_requeue_list(ns->queue);
 		blk_cleanup_queue(ns->queue);
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index c8541c3dcd19..661b5ff7c168 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1892,6 +1892,14 @@ static void nvme_remove_dead_ctrl(struct nvme_dev *dev, int status)
 
 	kref_get(&dev->ctrl.kref);
 	nvme_dev_disable(dev, false);
+
+	/*
+	 * nvme_dev_disable() has suspended queues, then no new I/O
+	 * can be submitted to hardware successfully any more, so
+	 * kill queues now for avoiding race between reset failure
+	 * and remove.
+	 */
+	nvme_kill_queues(&dev->ctrl);
 	if (!schedule_work(&dev->remove_work))
 		nvme_put_ctrl(&dev->ctrl);
 }
@@ -1998,7 +2006,6 @@ static void nvme_remove_dead_ctrl_work(struct work_struct *work)
 	struct nvme_dev *dev = container_of(work, struct nvme_dev, remove_work);
 	struct pci_dev *pdev = to_pci_dev(dev->dev);
 
-	nvme_kill_queues(&dev->ctrl);
 	if (pci_get_drvdata(pdev))
 		device_release_driver(&pdev->dev);
 	nvme_put_ctrl(&dev->ctrl);