From patchwork Wed Jan 25 17:42:24 2017
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Jens Axboe <axboe@kernel.dk>
X-Patchwork-Id: 9537617
Return-Path: <linux-block-owner@kernel.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
	[172.30.200.125])
	by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id
	CBDF26042C for <patchwork-linux-block@patchwork.kernel.org>;
	Wed, 25 Jan 2017 17:42:29 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id BA77428334
	for <patchwork-linux-block@patchwork.kernel.org>;
	Wed, 25 Jan 2017 17:42:29 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id ACCD428338; Wed, 25 Jan 2017 17:42:29 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-6.3 required=2.0 tests=BAYES_00,DKIM_SIGNED,
	RCVD_IN_DNSWL_HI, RCVD_IN_SORBS_SPAM,
	T_DKIM_INVALID autolearn=ham version=3.3.1
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id C672028334
	for <patchwork-linux-block@patchwork.kernel.org>;
	Wed, 25 Jan 2017 17:42:28 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1751863AbdAYRm2 (ORCPT
	<rfc822;patchwork-linux-block@patchwork.kernel.org>);
	Wed, 25 Jan 2017 12:42:28 -0500
Received: from mail-it0-f54.google.com ([209.85.214.54]:38836 "EHLO
	mail-it0-f54.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1751701AbdAYRm1 (ORCPT
	<rfc822;linux-block@vger.kernel.org>);
	Wed, 25 Jan 2017 12:42:27 -0500
Received: by mail-it0-f54.google.com with SMTP id c7so17538098itd.1
	for <linux-block@vger.kernel.org>;
	Wed, 25 Jan 2017 09:42:27 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=kernel-dk.20150623.gappssmtp.com; s=20150623;
	h=subject:to:references:cc:from:message-id:date:user-agent
	:mime-version:in-reply-to:content-transfer-encoding;
	bh=pWuayhsmItkDZ6+hyjJh0PPDeGEzXHxsZa+RGwU0kpU=;
	b=S/Clj3Q3jqt4HJNwT/QWRX6LGin8gRiDRPsWXK+HIS93cmXO/pr7+7FnPDjicB9Dv2
	ofa3Ys1bvnemkKlNJfAp4wV7QQiehj21KHxIR8IFJVkvytu5chGHDxXJ+lgQmyqU9KEy
	YmNVqIo5k/9ZTO384/RcfiMZR4ZIirseiDY0J3xXNcihgJz311BMlAuZJlf9VzdUkgQe
	fWLzsw17ZPgMqsd/C5UNoR1m3dgCHKS7O5olO3wCWdAnNP+3w3ubzCi8ox3mtzkq4VGo
	ROPCiIY68aEM8ec2FvNvoyoTMEJTqvjwarVvAVrk5GoBtMAPQtfMEn57qLEigf0Yu3So
	+aAA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=1e100.net; s=20161025;
	h=x-gm-message-state:subject:to:references:cc:from:message-id:date
	:user-agent:mime-version:in-reply-to:content-transfer-encoding;
	bh=pWuayhsmItkDZ6+hyjJh0PPDeGEzXHxsZa+RGwU0kpU=;
	b=hVtCESzONONqdfDrRpt/Hc21TY2R/F/a3HK2yySMM14El0cwXsRafKs+gcBPdcEujI
	ZR0lbSraygk1LshkYTfNwZqYblZ9iZrJoDoOK7/kzDK6oncVFATq7t6DohEd1QpyEku1
	O+v/JVihheKDtJSkK+0YfKb8BR5JOylALobGqjjHLKSIZufpXNTjdCE8e/x5tK55/M3G
	4nYq+hQ5JJ5lB7u+jZsCRkNE7V197ZmWRaTttNspu7QoFqsFV2nIeq7CogoGaWoCes+D
	CkNqIqdLhpwn899kMM3dNSEsq3QIZfIgxwC5xyNr+CIQuVS8EKWcCJCeOTsR3Wyl6T0B
	Y/1w==
X-Gm-Message-State: 
 AIkVDXJNCKtG2f2kYQfWuS4pq8kWZtnLxTRACVkhjQIJ5S43IgLgee5H+IDvH0ySPLxdwg==
X-Received: by 10.36.238.139 with SMTP id b133mr10547789iti.26.1485366146332;
	Wed, 25 Jan 2017 09:42:26 -0800 (PST)
Received: from [192.168.1.154] ([216.160.245.98])
	by smtp.gmail.com with ESMTPSA id
	16sm501563itu.17.2017.01.25.09.42.25
	(version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
	Wed, 25 Jan 2017 09:42:25 -0800 (PST)
Subject: Re: [PATCH] queue stall with blk-mq-sched
To: Hannes Reinecke <hare@suse.de>
References: <aa1c23e1-1beb-a26d-3e77-fe78aa281771@suse.de>
	<762cb508-1de0-93e2-5643-3fe946428eb5@fb.com>
	<8abc2430-e1fd-bece-ad52-c6d1d482c1e0@suse.de>
	<1663de5d-cdf7-a6ed-7539-c7d1f5e98f6c@fb.com>
	<717c595a-a3a6-0508-b537-8cf9e273271e@kernel.dk>
	<8178340b-dd64-c02d-0ef2-97ad5f928dc8@suse.de>
	<2b40b443-3bd6-717a-11ba-043886780adf@suse.de>
	<6035003f-029c-6cff-c35f-4e90496cab50@suse.de>
	<be2e5182-9cd8-5fea-77ac-4efd64e869ff@kernel.dk>
	<57539c5d-be3b-ab26-c6d4-a7ff554ded8b@suse.de>
	<3261ba64-cd7b-a7da-c407-c3b9828c3b57@kernel.dk>
Cc: "linux-block@vger.kernel.org" <linux-block@vger.kernel.org>,
	Omar Sandoval <osandov@osandov.com>
From: Jens Axboe <axboe@kernel.dk>
Message-ID: <262c4739-be6c-94e9-8e8c-6e97a602e881@kernel.dk>
Date: Wed, 25 Jan 2017 10:42:24 -0700
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101
	Thunderbird/45.5.1
MIME-Version: 1.0
In-Reply-To: <3261ba64-cd7b-a7da-c407-c3b9828c3b57@kernel.dk>
Sender: linux-block-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-block.vger.kernel.org>
X-Mailing-List: linux-block@vger.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

On 01/25/2017 10:03 AM, Jens Axboe wrote:
> On 01/25/2017 09:57 AM, Hannes Reinecke wrote:
>> On 01/25/2017 04:52 PM, Jens Axboe wrote:
>>> On 01/25/2017 04:10 AM, Hannes Reinecke wrote:
>> [ .. ]
>>>> Bah.
>>>>
>>>> Not quite. I'm still seeing some queues with state 'restart'.
>>>>
>>>> I've found that I need another patch on top of that:
>>>>
>>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>>> index e872555..edcbb44 100644
>>>> --- a/block/blk-mq.c
>>>> +++ b/block/blk-mq.c
>>>> @@ -753,8 +754,10 @@ static void blk_mq_timeout_work(struct work_struct
>>>> *work)
>>>>
>>>>                 queue_for_each_hw_ctx(q, hctx, i) {
>>>>                         /* the hctx may be unmapped, so check it here */
>>>> -                       if (blk_mq_hw_queue_mapped(hctx))
>>>> +                       if (blk_mq_hw_queue_mapped(hctx)) {
>>>>                                 blk_mq_tag_idle(hctx);
>>>> +                               blk_mq_sched_restart(hctx);
>>>> +                       }
>>>>                 }
>>>>         }
>>>>         blk_queue_exit(q);
>>>>
>>>>
>>>> Reasoning is that in blk_mq_get_tag() we might end up scheduling the
>>>> request on another hctx, but the original hctx might still have the
>>>> SCHED_RESTART bit set.
>>>> Which will never cleared as we complete the request on a different hctx,
>>>> so anything we do on the end_request side won't do us any good.
>>>
>>> I think you are right, it'll potentially trigger with shared tags and
>>> multiple hardware queues. I'll debug this today and come up with a
>>> decent fix.
>>>
>>> I committed the previous patch, fwiw.
>>>
>> THX.
>>
>> The above patch _does_ help in the sense that my testcase now completes 
>> without stalls. And I even get a decent performance with the mq-sched 
>> fixes: 82k IOPs sequential read with mq-deadline as compared to 44k IOPs 
>> when running without I/O scheduling.
>> Still some way off from the 132k IOPs I'm getting with CFQ, but we're 
>> getting there.
>>
>> However, I do get a noticeable stall during the stonewall sequence 
>> before the timeout handler kicks in, so the must be a better way for 
>> handling this.
>>
>> But nevertheless, thanks for all your work here.
>> Very much appreciated.
> 
> Yeah, the fix isn't really a fix, unless you are willing to tolerate
> potentially tens of seconds of extra latency until we idle it out :-)
> 
> So we can't use the un-idling for this, but we can track it on the
> shared state, which is the tags. The problem isn't that we are
> switching to a new hardware queue, it's if we mark the hardware queue
> as restart AND it has nothing pending. In that case, we'll never
> get it restarted, since IO completion is what restarts it.
> 
> I need to handle that case separately. Currently testing a patch, I
> should have something for you to test later today.

Can you try this one?

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index d05061f..6a1656d 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -300,6 +300,34 @@ bool blk_mq_sched_bypass_insert(struct blk_mq_hw_ctx *hctx, struct request *rq)
 }
 EXPORT_SYMBOL_GPL(blk_mq_sched_bypass_insert);
 
+static void blk_mq_sched_restart_hctx(struct blk_mq_hw_ctx *hctx)
+{
+	if (test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state)) {
+		clear_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state);
+		if (blk_mq_hctx_has_pending(hctx))
+			blk_mq_run_hw_queue(hctx, true);
+	}
+}
+
+void blk_mq_sched_restart_queues(struct blk_mq_hw_ctx *hctx)
+{
+	unsigned int i;
+
+	if (!(hctx->flags & BLK_MQ_F_TAG_SHARED))
+		blk_mq_sched_restart_hctx(hctx);
+	else {
+		struct request_queue *q = hctx->queue;
+
+		if (!test_bit(QUEUE_FLAG_RESTART, &q->queue_flags))
+			return;
+
+		clear_bit(QUEUE_FLAG_RESTART, &q->queue_flags);
+
+		queue_for_each_hw_ctx(q, hctx, i)
+			blk_mq_sched_restart_hctx(hctx);
+	}
+}
+
 static void blk_mq_sched_free_tags(struct blk_mq_tag_set *set,
 				   struct blk_mq_hw_ctx *hctx,
 				   unsigned int hctx_idx)
diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index 6b465bc..becbc78 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -19,6 +19,7 @@ bool blk_mq_sched_bypass_insert(struct blk_mq_hw_ctx *hctx, struct request *rq);
 bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio);
 bool __blk_mq_sched_bio_merge(struct request_queue *q, struct bio *bio);
 bool blk_mq_sched_try_insert_merge(struct request_queue *q, struct request *rq);
+void blk_mq_sched_restart_queues(struct blk_mq_hw_ctx *hctx);
 
 void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx);
 void blk_mq_sched_move_to_dispatch(struct blk_mq_hw_ctx *hctx,
@@ -123,11 +124,6 @@ blk_mq_sched_completed_request(struct blk_mq_hw_ctx *hctx, struct request *rq)
 	BUG_ON(rq->internal_tag == -1);
 
 	blk_mq_put_tag(hctx, hctx->sched_tags, rq->mq_ctx, rq->internal_tag);
-
-	if (test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state)) {
-		clear_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state);
-		blk_mq_run_hw_queue(hctx, true);
-	}
 }
 
 static inline void blk_mq_sched_started_request(struct request *rq)
@@ -160,8 +156,15 @@ static inline bool blk_mq_sched_has_work(struct blk_mq_hw_ctx *hctx)
 
 static inline void blk_mq_sched_mark_restart(struct blk_mq_hw_ctx *hctx)
 {
-	if (!test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state))
+	if (!test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state)) {
 		set_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state);
+		if (hctx->flags & BLK_MQ_F_TAG_SHARED) {
+			struct request_queue *q = hctx->queue;
+
+			if (!test_bit(QUEUE_FLAG_RESTART, &q->queue_flags))
+				set_bit(QUEUE_FLAG_RESTART, &q->queue_flags);
+		}
+	}
 }
 
 static inline bool blk_mq_sched_needs_restart(struct blk_mq_hw_ctx *hctx)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 4c3e667..3951b72 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -40,7 +40,7 @@ static LIST_HEAD(all_q_list);
 /*
  * Check if any of the ctx's have pending work in this hardware queue
  */
-static bool blk_mq_hctx_has_pending(struct blk_mq_hw_ctx *hctx)
+bool blk_mq_hctx_has_pending(struct blk_mq_hw_ctx *hctx)
 {
 	return sbitmap_any_bit_set(&hctx->ctx_map) ||
 			!list_empty_careful(&hctx->dispatch) ||
@@ -345,6 +345,7 @@ void __blk_mq_finish_request(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx,
 		blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag);
 	if (sched_tag != -1)
 		blk_mq_sched_completed_request(hctx, rq);
+	blk_mq_sched_restart_queues(hctx);
 	blk_queue_exit(q);
 }
 
@@ -928,8 +929,16 @@ bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hctx, struct list_head *list)
 		if (!blk_mq_get_driver_tag(rq, &hctx, false)) {
 			if (!queued && reorder_tags_to_front(list))
 				continue;
+
+			/*
+			 * We failed getting a driver tag. Mark the queue(s)
+			 * as needing a restart. Retry getting a tag again,
+			 * in case the needed IO completed right before we
+			 * marked the queue as needing a restart.
+			 */
 			blk_mq_sched_mark_restart(hctx);
-			break;
+			if (!blk_mq_get_driver_tag(rq, &hctx, false))
+				break;
 		}
 		list_del_init(&rq->queuelist);
 
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 6c24b90..077a400 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -33,6 +33,7 @@ int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr);
 void blk_mq_wake_waiters(struct request_queue *q);
 bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *, struct list_head *);
 void blk_mq_flush_busy_ctxs(struct blk_mq_hw_ctx *hctx, struct list_head *list);
+bool blk_mq_hctx_has_pending(struct blk_mq_hw_ctx *hctx);
 
 /*
  * Internal helpers for allocating/freeing the request map
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index ee1fb59..40ce491 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -607,6 +607,7 @@ struct request_queue {
 #define QUEUE_FLAG_FLUSH_NQ    25	/* flush not queueuable */
 #define QUEUE_FLAG_DAX         26	/* device supports DAX */
 #define QUEUE_FLAG_STATS       27	/* track rq completion times */
+#define QUEUE_FLAG_RESTART     28
 
 #define QUEUE_FLAG_DEFAULT	((1 << QUEUE_FLAG_IO_STAT) |		\
 				 (1 << QUEUE_FLAG_STACKABLE)	|	\