From patchwork Sun Feb 18 13:11:44 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Tejun Heo <tj@kernel.org>
X-Patchwork-Id: 10226751
Return-Path: <linux-block-owner@kernel.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
	[172.30.200.125])
	by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id
	525C0602DC for <patchwork-linux-block@patchwork.kernel.org>;
	Sun, 18 Feb 2018 13:11:52 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 2F40428AA8
	for <patchwork-linux-block@patchwork.kernel.org>;
	Sun, 18 Feb 2018 13:11:52 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id 1D5A428AC5; Sun, 18 Feb 2018 13:11:52 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-6.8 required=2.0 tests=BAYES_00,DKIM_SIGNED,
	RCVD_IN_DNSWL_HI,T_DKIM_INVALID autolearn=ham version=3.3.1
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 7582F28AA8
	for <patchwork-linux-block@patchwork.kernel.org>;
	Sun, 18 Feb 2018 13:11:51 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1751338AbeBRNLu (ORCPT
	<rfc822;patchwork-linux-block@patchwork.kernel.org>);
	Sun, 18 Feb 2018 08:11:50 -0500
Received: from mail-qt0-f195.google.com ([209.85.216.195]:46637 "EHLO
	mail-qt0-f195.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1751292AbeBRNLt (ORCPT
	<rfc822;linux-block@vger.kernel.org>);
	Sun, 18 Feb 2018 08:11:49 -0500
Received: by mail-qt0-f195.google.com with SMTP id u6so9212612qtg.13
	for <linux-block@vger.kernel.org>;
	Sun, 18 Feb 2018 05:11:48 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=gmail.com; s=20161025;
	h=sender:date:from:to:cc:subject:message-id:references:mime-version
	:content-disposition:in-reply-to:user-agent;
	bh=n20mlDwGkDu0MNAnjQjDlimCHMNCPPvJjV28uvbuDA0=;
	b=Bgp0t2ZxOsxCyA7ER0ke6SdOWZ9hNUCnWD/yAXL8oU8ewD+3AmU1QqAzUbazjkipCw
	4Ubci+cDbsSGY/A1EZcy1sZlYMeEv4q7Foei7IIAGAaA+ToJhIwDV4LgbTdCFkav7Kvg
	v9goz8HltgSLt//Ud6VoHNHWhqFqmhRhrruAXaz7dvzA61doT9YhZBh6vgvgcdoLYFGH
	Kua65ZZTczWPCC4XBgK2mxHdh8U3SF37zrKy/5x/MQvIDosn/SoZLIi6Y+iotlCiUzAo
	wIer2V1Uboq17Y2lGK3ylBANsP8riPNoYj8Eg6kWfnNU4aIJelTv61u9VJp0B5AFE9ox
	yzJw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=1e100.net; s=20161025;
	h=x-gm-message-state:sender:date:from:to:cc:subject:message-id
	:references:mime-version:content-disposition:in-reply-to:user-agent;
	bh=n20mlDwGkDu0MNAnjQjDlimCHMNCPPvJjV28uvbuDA0=;
	b=Zy03uA+YL0GLFcgfo7StSVkdEgJdkd8AcejtRoj4uXnXNQRmV1DY0K2A2+l+4h+9PD
	lnAFARE0aQqTeavbGyOwwAC1417LV0Se6zfSr+nAq6WEafQlAHgyQrqbuQP+u73KMkrI
	EPFuz3hVKnpkfGQ/DPmTp2EgvVzBsPek+jQAKZtBa253Rb7nLo3NbdJ//3hVb/SnTtCk
	g7VmPYq3es+GwQxGU/B1/vPSfpl1EjZPyqd87XOnCsarUtsUk6X57v73moa0xmjLB7X/
	zwsyLqHlR9ussUVgV1p2UVg/iKd4V+WRI75jHEdARtDPBOc4ZmGXPtcO+o8FasG9BwC0
	jG0w==
X-Gm-Message-State: APf1xPDmwie0JyYJXLg1FKDL+anaFhqNoDNwlaPA3rV44GWAZa1/bmil
	LMpM98Rvf4Yv2IrUV4O0bSk=
X-Google-Smtp-Source: 
 AH8x226FaD6LD9VJNsr0Jlh3yQVxV3LCjVkRDvKyyNk+RN1SBTPRymFtney5dDZBeOcDdcfEZnG57w==
X-Received: by 10.200.46.210 with SMTP id i18mr20126498qta.157.1518959508288;
	Sun, 18 Feb 2018 05:11:48 -0800 (PST)
Received: from localhost (dhcp-ec-8-6b-ed-7a-cf.cpe.echoes.net.
	[72.28.5.223]) by smtp.gmail.com with ESMTPSA id
	23sm16219460qtx.33.2018.02.18.05.11.46
	(version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
	Sun, 18 Feb 2018 05:11:47 -0800 (PST)
Date: Sun, 18 Feb 2018 05:11:44 -0800
From: "tj@kernel.org" <tj@kernel.org>
To: Bart Van Assche <Bart.VanAssche@wdc.com>
Cc: "hch@lst.de" <hch@lst.de>,
	"linux-block@vger.kernel.org" <linux-block@vger.kernel.org>,
	"axboe@kernel.dk" <axboe@kernel.dk>
Subject: Re: [PATCH v2] blk-mq: Fix race between resetting the timer and
	completion handling
Message-ID: <20180218131144.GX695913@devbig577.frc2.facebook.com>
References: <1518024428.2870.35.camel@wdc.com>
	<20180207173531.GC695913@devbig577.frc2.facebook.com>
	<1518027251.2870.53.camel@wdc.com>
	<20180207200724.GD695913@devbig577.frc2.facebook.com>
	<1518047297.2870.80.camel@wdc.com>
	<1518052193.2870.90.camel@wdc.com>
	<20180208153940.GM695913@devbig577.frc2.facebook.com>
	<1518107501.3611.19.camel@wdc.com>
	<20180213212044.GS695913@devbig577.frc2.facebook.com>
	<1518627534.3147.6.camel@wdc.com>
MIME-Version: 1.0
Content-Disposition: inline
In-Reply-To: <1518627534.3147.6.camel@wdc.com>
User-Agent: Mutt/1.5.21 (2010-09-15)
Sender: linux-block-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-block.vger.kernel.org>
X-Mailing-List: linux-block@vger.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

Hello, Bart.

On Wed, Feb 14, 2018 at 04:58:56PM +0000, Bart Van Assche wrote:
> With this patch applied the tests I ran so far pass.

Ah, great to hear.  Thanks a lot for testing.  Can you please verify
the following?  It's the same approach but with RCU sync batching.

Thanks.

Index: work/block/blk-mq.c
===================================================================
--- work.orig/block/blk-mq.c
+++ work/block/blk-mq.c
@@ -816,7 +816,8 @@ struct blk_mq_timeout_data {
 	unsigned int nr_expired;
 };
 
-static void blk_mq_rq_timed_out(struct request *req, bool reserved)
+static void blk_mq_rq_timed_out(struct blk_mq_hw_ctx *hctx, struct request *req,
+				int *nr_resets, bool reserved)
 {
 	const struct blk_mq_ops *ops = req->q->mq_ops;
 	enum blk_eh_timer_return ret = BLK_EH_RESET_TIMER;
@@ -831,13 +832,10 @@ static void blk_mq_rq_timed_out(struct r
 		__blk_mq_complete_request(req);
 		break;
 	case BLK_EH_RESET_TIMER:
-		/*
-		 * As nothing prevents from completion happening while
-		 * ->aborted_gstate is set, this may lead to ignored
-		 * completions and further spurious timeouts.
-		 */
-		blk_mq_rq_update_aborted_gstate(req, 0);
 		blk_add_timer(req);
+		req->rq_flags |= RQF_MQ_TIMEOUT_RESET;
+		(*nr_resets)++;
+		hctx->need_sync_rcu = true;
 		break;
 	case BLK_EH_NOT_HANDLED:
 		break;
@@ -874,13 +872,34 @@ static void blk_mq_check_expired(struct
 	    time_after_eq(jiffies, deadline)) {
 		blk_mq_rq_update_aborted_gstate(rq, gstate);
 		data->nr_expired++;
-		hctx->nr_expired++;
+		hctx->need_sync_rcu = true;
 	} else if (!data->next_set || time_after(data->next, deadline)) {
 		data->next = deadline;
 		data->next_set = 1;
 	}
 }
 
+static void blk_mq_timeout_sync_rcu(struct request_queue *q)
+{
+	struct blk_mq_hw_ctx *hctx;
+	bool has_rcu = false;
+	int i;
+
+	queue_for_each_hw_ctx(q, hctx, i) {
+		if (!hctx->need_sync_rcu)
+			continue;
+
+		if (!(hctx->flags & BLK_MQ_F_BLOCKING))
+			has_rcu = true;
+		else
+			synchronize_srcu(hctx->srcu);
+
+		hctx->need_sync_rcu = false;
+	}
+	if (has_rcu)
+		synchronize_rcu();
+}
+
 static void blk_mq_terminate_expired(struct blk_mq_hw_ctx *hctx,
 		struct request *rq, void *priv, bool reserved)
 {
@@ -893,7 +912,25 @@ static void blk_mq_terminate_expired(str
 	 */
 	if (!(rq->rq_flags & RQF_MQ_TIMEOUT_EXPIRED) &&
 	    READ_ONCE(rq->gstate) == rq->aborted_gstate)
-		blk_mq_rq_timed_out(rq, reserved);
+		blk_mq_rq_timed_out(hctx, rq, priv, reserved);
+}
+
+static void blk_mq_finish_timeout_reset(struct blk_mq_hw_ctx *hctx,
+		struct request *rq, void *priv, bool reserved)
+{
+	/*
+	 * @rq's timer reset has gone through rcu synchronization and is
+	 * visible now.  Allow normal completions again by resetting
+	 * ->aborted_gstate.  Don't clear RQF_MQ_TIMEOUT_RESET here as
+	 * there's no memory barrier around ->aborted_gstate.  Let
+	 * blk_add_timer() clear it later.
+	 *
+	 * As nothing prevents from completion happening while
+	 * ->aborted_gstate is set, this may lead to ignored completions
+	 * and further spurious timeouts.
+	 */
+	if (rq->rq_flags & RQF_MQ_TIMEOUT_RESET)
+		blk_mq_rq_update_aborted_gstate(rq, 0);
 }
 
 static void blk_mq_timeout_work(struct work_struct *work)
@@ -928,7 +965,7 @@ static void blk_mq_timeout_work(struct w
 	blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, &data);
 
 	if (data.nr_expired) {
-		bool has_rcu = false;
+		int nr_resets = 0;
 
 		/*
 		 * Wait till everyone sees ->aborted_gstate.  The
@@ -936,22 +973,22 @@ static void blk_mq_timeout_work(struct w
 		 * becomes a problem, we can add per-hw_ctx rcu_head and
 		 * wait in parallel.
 		 */
-		queue_for_each_hw_ctx(q, hctx, i) {
-			if (!hctx->nr_expired)
-				continue;
-
-			if (!(hctx->flags & BLK_MQ_F_BLOCKING))
-				has_rcu = true;
-			else
-				synchronize_srcu(hctx->srcu);
-
-			hctx->nr_expired = 0;
-		}
-		if (has_rcu)
-			synchronize_rcu();
+		blk_mq_timeout_sync_rcu(q);
 
 		/* terminate the ones we won */
-		blk_mq_queue_tag_busy_iter(q, blk_mq_terminate_expired, NULL);
+		blk_mq_queue_tag_busy_iter(q, blk_mq_terminate_expired,
+					   &nr_resets);
+
+		/*
+		 * For BLK_EH_RESET_TIMER, release the requests after
+		 * blk_add_timer() from above is visible to avoid timer
+		 * reset racing against recycling.
+		 */
+		if (nr_resets) {
+			blk_mq_timeout_sync_rcu(q);
+			blk_mq_queue_tag_busy_iter(q,
+					blk_mq_finish_timeout_reset, NULL);
+		}
 	}
 
 	if (data.next_set) {
Index: work/include/linux/blk-mq.h
===================================================================
--- work.orig/include/linux/blk-mq.h
+++ work/include/linux/blk-mq.h
@@ -51,7 +51,7 @@ struct blk_mq_hw_ctx {
 	unsigned int		queue_num;
 
 	atomic_t		nr_active;
-	unsigned int		nr_expired;
+	bool			need_sync_rcu;
 
 	struct hlist_node	cpuhp_dead;
 	struct kobject		kobj;
Index: work/block/blk-timeout.c
===================================================================
--- work.orig/block/blk-timeout.c
+++ work/block/blk-timeout.c
@@ -216,7 +216,7 @@ void blk_add_timer(struct request *req)
 		req->timeout = q->rq_timeout;
 
 	blk_rq_set_deadline(req, jiffies + req->timeout);
-	req->rq_flags &= ~RQF_MQ_TIMEOUT_EXPIRED;
+	req->rq_flags &= ~(RQF_MQ_TIMEOUT_EXPIRED | RQF_MQ_TIMEOUT_RESET);
 
 	/*
 	 * Only the non-mq case needs to add the request to a protected list.
Index: work/include/linux/blkdev.h
===================================================================
--- work.orig/include/linux/blkdev.h
+++ work/include/linux/blkdev.h
@@ -127,8 +127,10 @@ typedef __u32 __bitwise req_flags_t;
 #define RQF_ZONE_WRITE_LOCKED	((__force req_flags_t)(1 << 19))
 /* timeout is expired */
 #define RQF_MQ_TIMEOUT_EXPIRED	((__force req_flags_t)(1 << 20))
+/* timeout is expired */
+#define RQF_MQ_TIMEOUT_RESET	((__force req_flags_t)(1 << 21))
 /* already slept for hybrid poll */
-#define RQF_MQ_POLL_SLEPT	((__force req_flags_t)(1 << 21))
+#define RQF_MQ_POLL_SLEPT	((__force req_flags_t)(1 << 22))
 
 /* flags that prevent us from merging requests: */
 #define RQF_NOMERGE_FLAGS \