From patchwork Thu May 25 04:21:30 2017
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Ming Lei <ming.lei@redhat.com>
X-Patchwork-Id: 9747661
Return-Path: <linux-block-owner@kernel.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
	[172.30.200.125])
	by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id
	780D860209 for <patchwork-linux-block@patchwork.kernel.org>;
	Thu, 25 May 2017 04:22:11 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 8D22F26E46
	for <patchwork-linux-block@patchwork.kernel.org>;
	Thu, 25 May 2017 04:22:11 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id 81FF927569; Thu, 25 May 2017 04:22:11 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-6.9 required=2.0 tests=BAYES_00,RCVD_IN_DNSWL_HI
	autolearn=ham version=3.3.1
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id E838F26E46
	for <patchwork-linux-block@patchwork.kernel.org>;
	Thu, 25 May 2017 04:22:10 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S941721AbdEYEWK (ORCPT
	<rfc822;patchwork-linux-block@patchwork.kernel.org>);
	Thu, 25 May 2017 00:22:10 -0400
Received: from mx1.redhat.com ([209.132.183.28]:40884 "EHLO mx1.redhat.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S941653AbdEYEWI (ORCPT <rfc822;linux-block@vger.kernel.org>);
	Thu, 25 May 2017 00:22:08 -0400
Received: from smtp.corp.redhat.com
	(int-mx01.intmail.prod.int.phx2.redhat.com [10.5.11.11])
	(using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
	(No client certificate requested)
	by mx1.redhat.com (Postfix) with ESMTPS id 1BFA3C05678E;
	Thu, 25 May 2017 04:22:08 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mx1.redhat.com 1BFA3C05678E
Authentication-Results: ext-mx08.extmail.prod.ext.phx2.redhat.com;
	dmarc=none (p=none dis=none) header.from=redhat.com
Authentication-Results: ext-mx08.extmail.prod.ext.phx2.redhat.com;
	spf=pass smtp.mailfrom=ming.lei@redhat.com
DKIM-Filter: OpenDKIM Filter v2.11.0 mx1.redhat.com 1BFA3C05678E
Received: from localhost (ovpn-12-75.pek2.redhat.com [10.72.12.75])
	by smtp.corp.redhat.com (Postfix) with ESMTP id 039F419E64;
	Thu, 25 May 2017 04:22:01 +0000 (UTC)
From: Ming Lei <ming.lei@redhat.com>
To: Jens Axboe <axboe@fb.com>, linux-block@vger.kernel.org,
	Christoph Hellwig <hch@infradead.org>
Cc: Bart Van Assche <bart.vanassche@sandisk.com>
Subject: [PATCH 6/7] blk-mq: quiesce queue via percpu_ref
Date: Thu, 25 May 2017 12:21:30 +0800
Message-Id: <20170525042131.13172-7-ming.lei@redhat.com>
In-Reply-To: <20170525042131.13172-1-ming.lei@redhat.com>
References: <20170525042131.13172-1-ming.lei@redhat.com>
X-Scanned-By: MIMEDefang 2.79 on 10.5.11.11
X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16
	(mx1.redhat.com [10.5.110.32]);
	Thu, 25 May 2017 04:22:08 +0000 (UTC)
Sender: linux-block-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-block.vger.kernel.org>
X-Mailing-List: linux-block@vger.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

One big problem of blk_mq_quiesce_queue() is that it
can't prevent .queue_rq() in direct issue path from
being run even though hw queues are stopped.

It is observed that request double-free/use-after-free
can be triggered easily when canceling NVMe requests via
blk_mq_tagset_busy_iter(...nvme_cancel_request) in nvme_dev_disable().
The reason is that blk_mq_quiesce_queue() doesn't prevent
dispatch from being run.

Actually other drivers(mtip32xx, nbd) need to quiesce
queue first before canceling dispatched requests via
blk_mq_tagset_busy_iter(), otherwise use-after-free
can be caused.

This patch implements queue quiescing via percpu_ref,
and fixes the above issue, also has the following benefits:

- rcu & srcu are used for non-blocking and blocking respectively,
code becomes much clean after we unify both via percpu_ref

- the fat 'srcu_struct' can be removed from 'blk_mq_hw_ctx'

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c         | 112 +++++++++++++++++++++++++++++++------------------
 include/linux/blk-mq.h |   2 -
 include/linux/blkdev.h |   5 +++
 3 files changed, 77 insertions(+), 42 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index a26fee3fb389..6316efb42e04 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -164,20 +164,23 @@ EXPORT_SYMBOL_GPL(blk_mq_unfreeze_queue);
  */
 void blk_mq_quiesce_queue(struct request_queue *q)
 {
-	struct blk_mq_hw_ctx *hctx;
-	unsigned int i;
-	bool rcu = false;
+	bool old;
 
 	__blk_mq_stop_hw_queues(q, true);
 
-	queue_for_each_hw_ctx(q, hctx, i) {
-		if (hctx->flags & BLK_MQ_F_BLOCKING)
-			synchronize_srcu(&hctx->queue_rq_srcu);
-		else
-			rcu = true;
-	}
-	if (rcu)
-		synchronize_rcu();
+	mutex_lock(&q->quiesce_mutex);
+
+	spin_lock_irq(q->queue_lock);
+	old = queue_flag_test_and_set(QUEUE_FLAG_QUIESCED, q);
+	spin_unlock_irq(q->queue_lock);
+
+	if (old)
+		goto exit;
+
+	percpu_ref_kill(&q->q_dispatch_counter);
+	wait_event(q->quiesce_wq, percpu_ref_is_zero(&q->q_dispatch_counter));
+ exit:
+	mutex_unlock(&q->quiesce_mutex);
 }
 EXPORT_SYMBOL_GPL(blk_mq_quiesce_queue);
 
@@ -190,6 +193,21 @@ EXPORT_SYMBOL_GPL(blk_mq_quiesce_queue);
  */
 void blk_mq_unquiesce_queue(struct request_queue *q)
 {
+	bool old;
+
+	mutex_lock(&q->quiesce_mutex);
+
+	spin_lock_irq(q->queue_lock);
+	old = queue_flag_test_and_clear(QUEUE_FLAG_QUIESCED, q);
+	spin_unlock_irq(q->queue_lock);
+
+	if (!old)
+		goto exit;
+
+	percpu_ref_reinit(&q->q_dispatch_counter);
+ exit:
+	mutex_unlock(&q->quiesce_mutex);
+
 	blk_mq_start_stopped_hw_queues(q, true);
 }
 EXPORT_SYMBOL_GPL(blk_mq_unquiesce_queue);
@@ -209,6 +227,9 @@ void blk_mq_wake_waiters(struct request_queue *q)
 	 * the queue are notified as well.
 	 */
 	wake_up_all(&q->mq_freeze_wq);
+
+	/* Forcibly unquiesce the queue to avoid having stuck requests */
+	blk_mq_unquiesce_queue(q);
 }
 
 bool blk_mq_can_queue(struct blk_mq_hw_ctx *hctx)
@@ -1099,24 +1120,28 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list)
 	return (queued + errors) != 0;
 }
 
-static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx)
+static inline bool blk_mq_dispatch_start(struct request_queue *q)
+{
+	return percpu_ref_tryget_live(&q->q_dispatch_counter);
+}
+
+static inline void blk_mq_dispatch_end(struct request_queue *q)
 {
-	int srcu_idx;
+	percpu_ref_put(&q->q_dispatch_counter);
+}
 
+static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx)
+{
 	WARN_ON(!cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask) &&
 		cpu_online(hctx->next_cpu));
 
-	if (!(hctx->flags & BLK_MQ_F_BLOCKING)) {
-		rcu_read_lock();
-		blk_mq_sched_dispatch_requests(hctx);
-		rcu_read_unlock();
-	} else {
-		might_sleep();
+	if (!blk_mq_dispatch_start(hctx->queue))
+		return;
 
-		srcu_idx = srcu_read_lock(&hctx->queue_rq_srcu);
-		blk_mq_sched_dispatch_requests(hctx);
-		srcu_read_unlock(&hctx->queue_rq_srcu, srcu_idx);
-	}
+	might_sleep_if(hctx->flags & BLK_MQ_F_BLOCKING);
+	blk_mq_sched_dispatch_requests(hctx);
+
+	blk_mq_dispatch_end(hctx->queue);
 }
 
 /*
@@ -1519,18 +1544,14 @@ static void __blk_mq_try_issue_directly(struct request *rq, blk_qc_t *cookie,
 static void blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
 		struct request *rq, blk_qc_t *cookie)
 {
-	if (!(hctx->flags & BLK_MQ_F_BLOCKING)) {
-		rcu_read_lock();
-		__blk_mq_try_issue_directly(rq, cookie, false);
-		rcu_read_unlock();
-	} else {
-		unsigned int srcu_idx;
-
-		might_sleep();
+	bool blocking = hctx->flags & BLK_MQ_F_BLOCKING;
 
-		srcu_idx = srcu_read_lock(&hctx->queue_rq_srcu);
-		__blk_mq_try_issue_directly(rq, cookie, true);
-		srcu_read_unlock(&hctx->queue_rq_srcu, srcu_idx);
+	if (blk_mq_dispatch_start(hctx->queue)) {
+		might_sleep_if(blocking);
+		__blk_mq_try_issue_directly(rq, cookie, blocking);
+		blk_mq_dispatch_end(hctx->queue);
+	} else {
+		blk_mq_sched_insert_request(rq, false, false, false, blocking);
 	}
 }
 
@@ -1869,9 +1890,6 @@ static void blk_mq_exit_hctx(struct request_queue *q,
 	if (set->ops->exit_hctx)
 		set->ops->exit_hctx(hctx, hctx_idx);
 
-	if (hctx->flags & BLK_MQ_F_BLOCKING)
-		cleanup_srcu_struct(&hctx->queue_rq_srcu);
-
 	blk_mq_remove_cpuhp(hctx);
 	blk_free_flush_queue(hctx->fq);
 	sbitmap_free(&hctx->ctx_map);
@@ -1942,9 +1960,6 @@ static int blk_mq_init_hctx(struct request_queue *q,
 				   node))
 		goto free_fq;
 
-	if (hctx->flags & BLK_MQ_F_BLOCKING)
-		init_srcu_struct(&hctx->queue_rq_srcu);
-
 	blk_mq_debugfs_register_hctx(q, hctx);
 
 	return 0;
@@ -2272,12 +2287,27 @@ static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
 	blk_mq_sysfs_register(q);
 }
 
+static void blk_queue_dispatch_counter_release(struct percpu_ref *ref)
+{
+	struct request_queue *q =
+		container_of(ref, struct request_queue, q_dispatch_counter);
+
+	wake_up_all(&q->quiesce_wq);
+}
+
 struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 						  struct request_queue *q)
 {
 	/* mark the queue as mq asap */
 	q->mq_ops = set->ops;
 
+	init_waitqueue_head(&q->quiesce_wq);
+	mutex_init(&q->quiesce_mutex);
+	if (percpu_ref_init(&q->q_dispatch_counter,
+			    blk_queue_dispatch_counter_release,
+			    0, GFP_KERNEL))
+		goto err_ref;
+
 	q->poll_cb = blk_stat_alloc_callback(blk_mq_poll_stats_fn,
 					     blk_mq_poll_stats_bkt,
 					     BLK_MQ_POLL_STATS_BKTS, q);
@@ -2360,6 +2390,8 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 err_percpu:
 	free_percpu(q->queue_ctx);
 err_exit:
+	percpu_ref_exit(&q->q_dispatch_counter);
+err_ref:
 	q->mq_ops = NULL;
 	return ERR_PTR(-ENOMEM);
 }
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index fcd641032f8d..6da015e6d38a 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -39,8 +39,6 @@ struct blk_mq_hw_ctx {
 	struct blk_mq_tags	*tags;
 	struct blk_mq_tags	*sched_tags;
 
-	struct srcu_struct	queue_rq_srcu;
-
 	unsigned long		queued;
 	unsigned long		run;
 #define BLK_MQ_MAX_DISPATCH_ORDER	7
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 60967797f4f6..a85841e9113d 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -586,6 +586,11 @@ struct request_queue {
 
 	size_t			cmd_size;
 	void			*rq_alloc_data;
+
+	/* for blk_mq_quiesce_queue() */
+	wait_queue_head_t	quiesce_wq;
+	struct percpu_ref	q_dispatch_counter;
+	struct mutex		quiesce_mutex;
 };
 
 #define QUEUE_FLAG_QUEUED	1	/* uses generic tag queueing */