From patchwork Tue Apr 10 21:54:56 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Tejun Heo <tj@kernel.org>
X-Patchwork-Id: 10333887
Return-Path: <linux-block-owner@kernel.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
	[172.30.200.125])
	by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id
	6E1E06028A for <patchwork-linux-block@patchwork.kernel.org>;
	Tue, 10 Apr 2018 21:55:03 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 5AAAD2848B
	for <patchwork-linux-block@patchwork.kernel.org>;
	Tue, 10 Apr 2018 21:55:03 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id 4C0C8284B3; Tue, 10 Apr 2018 21:55:03 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-7.8 required=2.0 tests=BAYES_00,DKIM_SIGNED,
	MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI,
	T_DKIM_INVALID autolearn=ham version=3.3.1
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id B7DA52848B
	for <patchwork-linux-block@patchwork.kernel.org>;
	Tue, 10 Apr 2018 21:55:02 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1752078AbeDJVzB (ORCPT
	<rfc822;patchwork-linux-block@patchwork.kernel.org>);
	Tue, 10 Apr 2018 17:55:01 -0400
Received: from mail-yb0-f194.google.com ([209.85.213.194]:33697 "EHLO
	mail-yb0-f194.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1751799AbeDJVzA (ORCPT
	<rfc822;linux-block@vger.kernel.org>);
	Tue, 10 Apr 2018 17:55:00 -0400
Received: by mail-yb0-f194.google.com with SMTP id f5-v6so4867109ybg.0;
	Tue, 10 Apr 2018 14:55:00 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=gmail.com; s=20161025;
	h=sender:date:from:to:cc:subject:message-id:references:mime-version
	:content-disposition:in-reply-to:user-agent;
	bh=mwdLHvXcFtyfgTraS3igeGPrYs209mB+xz4dvfS23CA=;
	b=ZiQxRREKwjCqfZutx6uKTmvO24JfU4stnlusbf2t4n82sQhwTz7f4zCfnWrWWzcxrS
	HgRmjRVnlfN8uSaht1ptUv8E4sG962vkR6L38osOZRQTDf3n1FYIgUJJE/n20pff4p84
	ycNrakT0UHuQHMVcrUPx6f3+IfE45kFktTc0ZOyvuhuvk4inSETTrFoo/VkjLBbmAOag
	ZLLCFQgFt9YLiPmKtklg92dwd5H6GUebD5barPL8ym9ifkchrPttS9IeBayYm/OGEe0J
	orOcy0s8/nzmG5jExOiRbEYbW/U6uNj8ywiHeHTy9Z0zqZsIbj4o6RoC6Z05J4JQVPiS
	Z9KA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=1e100.net; s=20161025;
	h=x-gm-message-state:sender:date:from:to:cc:subject:message-id
	:references:mime-version:content-disposition:in-reply-to:user-agent;
	bh=mwdLHvXcFtyfgTraS3igeGPrYs209mB+xz4dvfS23CA=;
	b=FLOXDlW/w3Xh+D44jU5xOFf50CmMEppBLH4C3ba683ps7S6IkbIYUttV77+Fo1NcmJ
	m5H090tb2Ru7nEEYIgNF4q0XNxRywC3oy1TXSSGtW+nTQf1nUpmFKkbZ+hWefscrOsjK
	z4rZAQk/aBwec+3JWICtCRH9iMlh13kUXH2O3TI5Dor809iM6M+WZy59WqqbcsfQf/Xh
	ViXm7yiicnDlBTfhFnQ+NkQe2kzCm0MP+erUfmJGp7bWL4MnmbOasUL9rG47FP9rEW4s
	P2WzP/E3iaRQvwkf6ur7lU4QvOexuM2FWd4af+RmyZqTjcRZiZRUQHsQ0qVOT7DXQHqj
	Ch5w==
X-Gm-Message-State: ALQs6tCG+GxonAfZFwcDMH5LBnqfI5MgoT4pTFN5ZYf+X2X1JABktMn4
	7hMUar6EpTwEAJQD8qX6qbs=
X-Google-Smtp-Source: 
 AIpwx4/6xqTfwTGdQ5FW4/ARLltiua2wE89I/G+vTNgrZUVTTZlJ6G/LQoiTK6skc4ioiSG2Nxtuow==
X-Received: by 2002:a25:aba1:: with SMTP id
	v30-v6mr105367ybi.388.1523397299432;
	Tue, 10 Apr 2018 14:54:59 -0700 (PDT)
Received: from localhost ([2620:10d:c091:180::1:f676])
	by smtp.gmail.com with ESMTPSA id
	n7sm1555739ywc.109.2018.04.10.14.54.57
	(version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
	Tue, 10 Apr 2018 14:54:58 -0700 (PDT)
Date: Tue, 10 Apr 2018 14:54:56 -0700
From: "tj@kernel.org" <tj@kernel.org>
To: Bart Van Assche <Bart.VanAssche@wdc.com>
Cc: "ming.lei@redhat.com" <ming.lei@redhat.com>,
	"hch@lst.de" <hch@lst.de>, "maxg@mellanox.com" <maxg@mellanox.com>,
	"israelr@mellanox.com" <israelr@mellanox.com>,
	"linux-block@vger.kernel.org" <linux-block@vger.kernel.org>,
	"stable@vger.kernel.org" <stable@vger.kernel.org>,
	"sagi@grimberg.me" <sagi@grimberg.me>,
	"axboe@kernel.dk" <axboe@kernel.dk>
Subject: Re: [PATCH v4] blk-mq: Fix race conditions in request timeout
	handling
Message-ID: <20180410215456.GD793541@devbig577.frc2.facebook.com>
References: <20180410135541.GA22340@ming.t460p>
	<7c4a8b019182d5a9259ef20e6462b3f6a533abed.camel@wdc.com>
	<20180410143007.GB22340@ming.t460p>
	<e2762e359649946543ee22cf6ebf0c08410f7963.camel@wdc.com>
	<20180410152553.GC22340@ming.t460p>
	<20180410153031.GO3126663@devbig577.frc2.facebook.com>
	<20180410153812.GA3219@ming.t460p>
	<20180410154043.GP3126663@devbig577.frc2.facebook.com>
	<20180410213323.GC793541@devbig577.frc2.facebook.com>
	<d88becef654cde4b5382c1ebddb0aeba963c0578.camel@wdc.com>
MIME-Version: 1.0
Content-Disposition: inline
In-Reply-To: <d88becef654cde4b5382c1ebddb0aeba963c0578.camel@wdc.com>
User-Agent: Mutt/1.5.21 (2010-09-15)
Sender: linux-block-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-block.vger.kernel.org>
X-Mailing-List: linux-block@vger.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

On Tue, Apr 10, 2018 at 09:46:23PM +0000, Bart Van Assche wrote:
> On Tue, 2018-04-10 at 14:33 -0700, tj@kernel.org wrote:
> > +       else
> > +               rq->missed_completion = true;
> 
> In this patch I found code that sets rq->missed_completion but no code that
> clears it. Did I perhaps miss something?

Ah, yeah, I was moving it out of add_timer but forgot to actully add
it to the issue path.  Fixed patch below.

BTW, no matter what we do w/ the request handover between normal and
timeout paths, we'd need something similar.  Otherwise, while we can
reduce the window, we can't get rid of it.

Thanks.
---
 block/blk-mq.c         |   45 ++++++++++++++++++++++++++++++++-------------
 include/linux/blkdev.h |    2 ++
 2 files changed, 34 insertions(+), 13 deletions(-)

--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -642,6 +642,8 @@ void blk_mq_complete_request(struct requ
 	hctx_lock(hctx, &srcu_idx);
 	if (blk_mq_rq_aborted_gstate(rq) != rq->gstate)
 		__blk_mq_complete_request(rq);
+	else
+		rq->missed_completion = true;
 	hctx_unlock(hctx, srcu_idx);
 }
 EXPORT_SYMBOL(blk_mq_complete_request);
@@ -684,6 +686,7 @@ void blk_mq_start_request(struct request
 
 	blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT);
 	blk_add_timer(rq);
+	rq->missed_completion = false;
 
 	write_seqcount_end(&rq->gstate_seq);
 	preempt_enable();
@@ -881,7 +884,7 @@ static void blk_mq_check_expired(struct
 	}
 }
 
-static void blk_mq_timeout_sync_rcu(struct request_queue *q)
+static void blk_mq_timeout_sync_rcu(struct request_queue *q, bool clear)
 {
 	struct blk_mq_hw_ctx *hctx;
 	bool has_rcu = false;
@@ -896,7 +899,8 @@ static void blk_mq_timeout_sync_rcu(stru
 		else
 			synchronize_srcu(hctx->srcu);
 
-		hctx->need_sync_rcu = false;
+		if (clear)
+			hctx->need_sync_rcu = false;
 	}
 	if (has_rcu)
 		synchronize_rcu();
@@ -917,25 +921,37 @@ static void blk_mq_terminate_expired(str
 		blk_mq_rq_timed_out(hctx, rq, priv, reserved);
 }
 
-static void blk_mq_finish_timeout_reset(struct blk_mq_hw_ctx *hctx,
+static void blk_mq_timeout_reset_return(struct blk_mq_hw_ctx *hctx,
 		struct request *rq, void *priv, bool reserved)
 {
 	/*
 	 * @rq's timer reset has gone through rcu synchronization and is
 	 * visible now.  Allow normal completions again by resetting
 	 * ->aborted_gstate.  Don't clear RQF_MQ_TIMEOUT_RESET here as
-	 * there's no memory ordering around ->aborted_gstate making it the
-	 * only field safe to update.  Let blk_add_timer() clear it later
-	 * when the request is recycled or times out again.
-	 *
-	 * As nothing prevents from completion happening while
-	 * ->aborted_gstate is set, this may lead to ignored completions
-	 * and further spurious timeouts.
+	 * blk_mq_timeout_reset_cleanup() needs it again and there's no
+	 * memory ordering around ->aborted_gstate making it the only field
+	 * safe to update.  Let blk_add_timer() clear it later when the
+	 * request is recycled or times out again.
 	 */
 	if (rq->rq_flags & RQF_MQ_TIMEOUT_RESET)
 		blk_mq_rq_update_aborted_gstate(rq, 0);
 }
 
+static void blk_mq_timeout_reset_cleanup(struct blk_mq_hw_ctx *hctx,
+		struct request *rq, void *priv, bool reserved)
+{
+	/*
+	 * @rq is now fully returned to the normal path.  If normal
+	 * completion raced timeout handling, execute the missed completion
+	 * here.  This is safe because 1. ->missed_completion can no longer
+	 * be asserted because nothing is timing out right now and 2. if
+	 * ->missed_completion is set, @rq is safe from recycling because
+	 * nobody could have completed it.
+	 */
+	if ((rq->rq_flags & RQF_MQ_TIMEOUT_RESET) && rq->missed_completion)
+		blk_mq_complete_request(rq);
+}
+
 static void blk_mq_timeout_work(struct work_struct *work)
 {
 	struct request_queue *q =
@@ -976,7 +992,7 @@ static void blk_mq_timeout_work(struct w
 		 * becomes a problem, we can add per-hw_ctx rcu_head and
 		 * wait in parallel.
 		 */
-		blk_mq_timeout_sync_rcu(q);
+		blk_mq_timeout_sync_rcu(q, true);
 
 		/* terminate the ones we won */
 		blk_mq_queue_tag_busy_iter(q, blk_mq_terminate_expired,
@@ -988,9 +1004,12 @@ static void blk_mq_timeout_work(struct w
 		 * reset racing against recycling.
 		 */
 		if (nr_resets) {
-			blk_mq_timeout_sync_rcu(q);
+			blk_mq_timeout_sync_rcu(q, false);
+			blk_mq_queue_tag_busy_iter(q,
+					blk_mq_timeout_reset_return, NULL);
+			blk_mq_timeout_sync_rcu(q, true);
 			blk_mq_queue_tag_busy_iter(q,
-					blk_mq_finish_timeout_reset, NULL);
+					blk_mq_timeout_reset_cleanup, NULL);
 		}
 	}
 
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -227,6 +227,8 @@ struct request {
 
 	unsigned int extra_len;	/* length of alignment and padding */
 
+	bool missed_completion;
+
 	/*
 	 * On blk-mq, the lower bits of ->gstate (generation number and
 	 * state) carry the MQ_RQ_* state value and the upper bits the