From patchwork Tue Feb 21 21:29:56 2017
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Omar Sandoval <osandov@osandov.com>
X-Patchwork-Id: 9585755
Return-Path: <linux-block-owner@kernel.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
	[172.30.200.125])
	by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id
	5D736600C1 for <patchwork-linux-block@patchwork.kernel.org>;
	Tue, 21 Feb 2017 21:30:47 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 437E5285DD
	for <patchwork-linux-block@patchwork.kernel.org>;
	Tue, 21 Feb 2017 21:30:47 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id 3827328622; Tue, 21 Feb 2017 21:30:47 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-6.9 required=2.0 tests=BAYES_00,DKIM_SIGNED,
	DKIM_VALID,RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 7EB46285DD
	for <patchwork-linux-block@patchwork.kernel.org>;
	Tue, 21 Feb 2017 21:30:46 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1751452AbdBUVaq (ORCPT
	<rfc822;patchwork-linux-block@patchwork.kernel.org>);
	Tue, 21 Feb 2017 16:30:46 -0500
Received: from mail-pg0-f42.google.com ([74.125.83.42]:35590 "EHLO
	mail-pg0-f42.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1751397AbdBUVap (ORCPT
	<rfc822;linux-block@vger.kernel.org>);
	Tue, 21 Feb 2017 16:30:45 -0500
Received: by mail-pg0-f42.google.com with SMTP id b129so41352104pgc.2
	for <linux-block@vger.kernel.org>;
	Tue, 21 Feb 2017 13:30:45 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=osandov-com.20150623.gappssmtp.com; s=20150623;
	h=from:to:cc:subject:date:message-id;
	bh=qFjywDzryo29cq3bA+vAS7O2vdxhhLIwd3UtoSXkCOo=;
	b=F1Je1G2dT1bkSqtVgT35OfwmqDONrvIOKR1r6j8YMamd/rmgIWRLoRkuidrjORErK0
	aM8h1VgadOzToxgYOPHtN70wOkdkGpxaE73ir6cGMm9E2WfWNNXebjfo2CCs0bxUkvyY
	kbzPHf/VMnJUxBp5f6jAgwu52Fugi+sslT8iFt/dV4OpH60/X8MaZL4LCoVMr02Zxcww
	fu11sl88S0KirRJLLHrZuqrYfkY9IuKOFBPsycF4WXa3LppoisBi6r6HvSbx5sEQmJRQ
	3ZrTOSUv9JDxtw3nvH7fyR0WIV6rtu3tSi3JTUu7DSK7UU+kMUzVZaLWz5YQF3PO1rq9
	ukcg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=1e100.net; s=20161025;
	h=x-gm-message-state:from:to:cc:subject:date:message-id;
	bh=qFjywDzryo29cq3bA+vAS7O2vdxhhLIwd3UtoSXkCOo=;
	b=P20QmAZUAlVxXCylGMk7eTydfBovfbmX9DwFw+MG3ojRRrCVgGGFO6A30KZ05B6FMY
	ZrjHnOcE+mEJuHP6nbV1TeHZ6HiKVgfNxqwm0tHawOB11itbRMaZCHZa6JnrzEaHfVwc
	2nNuMYV5L0nWmx7T+6Ju4M4WxlHuFSFYOzwhS72ICvqdXJ6YvIqloNxo82p/+EMEVRke
	40Xg4NHi1ukFD/SyhskOhKeiN8Dh+CNvxjd3uCeTGnrBmsqkF42UvHY7XilzGKSE2sze
	tyI9P5sllk5m004f1BLr1ZzTmuWPc6ltATSsPJlL47I6yB/AEDZnDqDmFrY8gV2zpzfE
	FZkw==
X-Gm-Message-State: 
 AMke39nOEIYpf0rBUF4mDNNwpNYWL8xCmOasxetpYl05/l4ekgiVB2j5Ze23EfVSD5GexSfp
X-Received: by 10.99.39.71 with SMTP id n68mr36991663pgn.85.1487712644697;
	Tue, 21 Feb 2017 13:30:44 -0800 (PST)
Received: from vader.thefacebook.com ([2620:10d:c090:200::2:7fc8])
	by smtp.gmail.com with ESMTPSA id
	a76sm42617030pfe.131.2017.02.21.13.30.42
	(version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
	Tue, 21 Feb 2017 13:30:42 -0800 (PST)
From: Omar Sandoval <osandov@osandov.com>
To: Jens Axboe <axboe@fb.com>, linux-block@vger.kernel.org
Cc: kernel-team@fb.com
Subject: [PATCH v2 1/2] blk-mq: use sbq wait queues instead of restart for
	driver tags
Date: Tue, 21 Feb 2017 13:29:56 -0800
Message-Id: 
 <19bf336f1d329df8d12fbdc5ab8842f81880c71d.1487712413.git.osandov@fb.com>
X-Mailer: git-send-email 2.11.1
Sender: linux-block-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-block.vger.kernel.org>
X-Mailing-List: linux-block@vger.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

From: Omar Sandoval <osandov@fb.com>

Commit 50e1dab86aa2 ("blk-mq-sched: fix starvation for multiple hardware
queues and shared tags") fixed one starvation issue for shared tags.
However, we can still get into a situation where we fail to allocate a
tag because all tags are allocated but we don't have any pending
requests on any hardware queue.

One solution for this would be to restart all queues that share a tag
map, but that really sucks. Ideally, we could just block and wait for a
tag, but that isn't always possible from blk_mq_dispatch_rq_list().

However, we can still use the struct sbitmap_queue wait queues with a
custom callback instead of blocking. This has a few benefits:

1. It avoids iterating over all hardware queues when completing an I/O,
   which the current restart code has to do.
2. It benefits from the existing rolling wakeup code.
3. It avoids punting to another thread just to have it block.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
---
Changed from v1:

- Call blk_mq_get_driver_tag() with &hctx, not NULL, to make sure we
  handle the case of a request from a different hardware queue than the
  one we're running (e.g., with nvme+mq-deadline)

 block/blk-mq.c         | 58 ++++++++++++++++++++++++++++++++++++++++++++------
 include/linux/blk-mq.h |  2 ++
 2 files changed, 54 insertions(+), 6 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index b29e7dc7b309..bcb9f8742f72 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -904,6 +904,44 @@ static bool reorder_tags_to_front(struct list_head *list)
 	return first != NULL;
 }
 
+static int blk_mq_dispatch_wake(wait_queue_t *wait, unsigned mode, int flags,
+				void *key)
+{
+	struct blk_mq_hw_ctx *hctx;
+
+	hctx = container_of(wait, struct blk_mq_hw_ctx, dispatch_wait);
+
+	list_del(&wait->task_list);
+	clear_bit_unlock(BLK_MQ_S_TAG_WAITING, &hctx->state);
+	blk_mq_run_hw_queue(hctx, true);
+	return 1;
+}
+
+static bool blk_mq_dispatch_wait_add(struct blk_mq_hw_ctx *hctx)
+{
+	struct sbq_wait_state *ws;
+
+	/*
+	 * The TAG_WAITING bit serves as a lock protecting hctx->dispatch_wait.
+	 * The thread which wins the race to grab this bit adds the hardware
+	 * queue to the wait queue.
+	 */
+	if (test_bit(BLK_MQ_S_TAG_WAITING, &hctx->state) ||
+	    test_and_set_bit_lock(BLK_MQ_S_TAG_WAITING, &hctx->state))
+		return false;
+
+	init_waitqueue_func_entry(&hctx->dispatch_wait, blk_mq_dispatch_wake);
+	ws = bt_wait_ptr(&hctx->tags->bitmap_tags, hctx);
+
+	/*
+	 * As soon as this returns, it's no longer safe to fiddle with
+	 * hctx->dispatch_wait, since a completion can wake up the wait queue
+	 * and unlock the bit.
+	 */
+	add_wait_queue(&ws->wait, &hctx->dispatch_wait);
+	return true;
+}
+
 bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hctx, struct list_head *list)
 {
 	struct request_queue *q = hctx->queue;
@@ -931,15 +969,22 @@ bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hctx, struct list_head *list)
 				continue;
 
 			/*
-			 * We failed getting a driver tag. Mark the queue(s)
-			 * as needing a restart. Retry getting a tag again,
-			 * in case the needed IO completed right before we
-			 * marked the queue as needing a restart.
+			 * The initial allocation attempt failed, so we need to
+			 * rerun the hardware queue when a tag is freed.
 			 */
-			blk_mq_sched_mark_restart(hctx);
-			if (!blk_mq_get_driver_tag(rq, &hctx, false))
+			if (blk_mq_dispatch_wait_add(hctx)) {
+				/*
+				 * It's possible that a tag was freed in the
+				 * window between the allocation failure and
+				 * adding the hardware queue to the wait queue.
+				 */
+				if (!blk_mq_get_driver_tag(rq, &hctx, false))
+					break;
+			} else {
 				break;
+			}
 		}
+
 		list_del_init(&rq->queuelist);
 
 		bd.rq = rq;
@@ -1051,6 +1096,7 @@ static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx *hctx)
 void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
 {
 	if (unlikely(blk_mq_hctx_stopped(hctx) ||
+		     test_bit(BLK_MQ_S_TAG_WAITING, &hctx->state) ||
 		     !blk_mq_hw_queue_mapped(hctx)))
 		return;
 
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 8e4df3d6c8cd..001d30d727c5 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -33,6 +33,7 @@ struct blk_mq_hw_ctx {
 	struct blk_mq_ctx	**ctxs;
 	unsigned int		nr_ctx;
 
+	wait_queue_t		dispatch_wait;
 	atomic_t		wait_index;
 
 	struct blk_mq_tags	*tags;
@@ -160,6 +161,7 @@ enum {
 	BLK_MQ_S_STOPPED	= 0,
 	BLK_MQ_S_TAG_ACTIVE	= 1,
 	BLK_MQ_S_SCHED_RESTART	= 2,
+	BLK_MQ_S_TAG_WAITING	= 3,
 
 	BLK_MQ_MAX_DEPTH	= 10240,