From patchwork Thu Jun  4 18:40:21 2015
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Rajat Jain <rajatja@google.com>
X-Patchwork-Id: 6549061
Return-Path: <linux-scsi-owner@kernel.org>
X-Original-To: patchwork-linux-scsi@patchwork.kernel.org
Delivered-To: patchwork-parsemail@patchwork2.web.kernel.org
Received: from mail.kernel.org (mail.kernel.org [198.145.29.136])
	by patchwork2.web.kernel.org (Postfix) with ESMTP id B2493C0020
	for <patchwork-linux-scsi@patchwork.kernel.org>;
	Thu,  4 Jun 2015 18:41:01 +0000 (UTC)
Received: from mail.kernel.org (localhost [127.0.0.1])
	by mail.kernel.org (Postfix) with ESMTP id C76AB20797
	for <patchwork-linux-scsi@patchwork.kernel.org>;
	Thu,  4 Jun 2015 18:41:00 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.kernel.org (Postfix) with ESMTP id D894E2074A
	for <patchwork-linux-scsi@patchwork.kernel.org>;
	Thu,  4 Jun 2015 18:40:59 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S932280AbbFDSkq (ORCPT
	<rfc822;patchwork-linux-scsi@patchwork.kernel.org>);
	Thu, 4 Jun 2015 14:40:46 -0400
Received: from mail-ie0-f169.google.com ([209.85.223.169]:34510 "EHLO
	mail-ie0-f169.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S932209AbbFDSko (ORCPT
	<rfc822; linux-scsi@vger.kernel.org>); Thu, 4 Jun 2015 14:40:44 -0400
Received: by iebmu5 with SMTP id mu5so7319874ieb.1
	for <linux-scsi@vger.kernel.org>;
	Thu, 04 Jun 2015 11:40:44 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=google.com; s=20120113;
	h=from:to:cc:subject:date:message-id;
	bh=tqQirDns8YjMMW3dal6ZnNFcB7M+QSvU0j1irrzQy+Q=;
	b=Ig2g856si0r9Li27832TLx9hdEPj4SRzstHF1DYuyJRig1IUAVqX0304FSE+nCQmhz
	RqGgNN3gwW6uZEjBH/Rc8ZbcEI1oFaUtU9ERGRfb2vLbLE+c4RvNvCV96SwCpPTwu3gk
	GcdnJ+gJ32tyhkD1o/yI9TKEjU3ENkBbdXgNrCFF//PnfwcFds0UQOQVHDCNym0Xoc1z
	WtyZxbZA4AuWk2ftVu9WDSbqS3d27ZKWQKxfGXnNIkK6inndsncKHuFskrZ3kbYaLllo
	QtuNneLNTt82DvbxE909lflQESWjo0W6ao6kxL7iZf2QWFzHebrU0ZJbO177DX+XJ8sT
	Ojjg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=1e100.net; s=20130820;
	h=x-gm-message-state:from:to:cc:subject:date:message-id;
	bh=tqQirDns8YjMMW3dal6ZnNFcB7M+QSvU0j1irrzQy+Q=;
	b=AmtBy3HhEdfgnikUwgaR4sjT6txDRyyT0G4PZS9bSKXRnExCKuZ7GSWWkhSy4WwcnG
	DP/NHFxobiegcbbnOYl9Vobz2uI++bFI9ySU5cH/E0VuCbxUh0wd2GPrirAjYY4XPXOy
	mUhSHk56rbLW+mAHT1lPZEXZjpCalfuTTge+sNasVrD0ctUn43n+CaylnYRO0RqPOuh1
	l0w7P8kYNJpHrxHBWD8Kyjd8/NnaUxGmjuW0ktQhCAHKoD5syJkkLBBQhdVSfxVAGApK
	KAWWYuo99Bh6MeQwnsSkABxXyNynO2hnfZS9K6n8tUFN9txjIsrIGuUaicdDY822Qlpz
	465g==
X-Gm-Message-State: 
 ALoCoQmtEYrf1KfbrfTTuAAmqmbsHv8KDq3mQmh/yUs+HPAn4mhpmyjIhLACQ+yaH1UvqXpxwLa2
X-Received: by 10.50.142.67 with SMTP id ru3mr36121972igb.16.1433443244132;
	Thu, 04 Jun 2015 11:40:44 -0700 (PDT)
Received: from rajatja.mtv.corp.google.com ([172.18.65.137])
	by mx.google.com with ESMTPSA id
	b15sm14743675igm.12.2015.06.04.11.40.42
	(version=TLSv1.2 cipher=ECDHE-RSA-AES128-SHA bits=128/128);
	Thu, 04 Jun 2015 11:40:43 -0700 (PDT)
From: Rajat Jain <rajatja@google.com>
To: "James E.J. Bottomley" <JBottomley@odin.com>,
	linux-scsi@vger.kernel.org, linux-kernel@vger.kernel.org
Cc: rajatxjain@gmail.com, Rajat Jain <rajatja@google.com>
Subject: [PATCH] scsi: Avoid potential infinite eh_timeout_handler() loop
Date: Thu,  4 Jun 2015 11:40:21 -0700
Message-Id: <1433443222-8260-1-git-send-email-rajatja@google.com>
X-Mailer: git-send-email 2.2.0.rc0.207.ga3a616c
Sender: linux-scsi-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-scsi.vger.kernel.org>
X-Mailing-List: linux-scsi@vger.kernel.org
X-Spam-Status: No, score=-6.8 required=5.0 tests=BAYES_00,
	DKIM_ADSP_CUSTOM_MED,
	DKIM_SIGNED,RCVD_IN_DNSWL_HI,T_DKIM_INVALID,T_RP_MATCHES_RCVD,
	UNPARSEABLE_RELAY autolearn=unavailable version=3.3.1
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on mail.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

Each cmd timeout should result in scmd->retries++. Currently it happens
just only before a command is requeued back. However, if the LLD
eh_timed_out() handler asks to reset timer back again, then also it should
be incremented because effectively LLD will be given a full time period
(SD_TIMEOUT = 30 secs!) to attempt to complete the command.

Why this is a problem:

  => Currently the SCSI low level transport drivers can provide
     eh_timeout_handler() calls (for e.g. iscsi provides this) to deal
     with command timeouts.

  => The eh_timeout_handler() can return BLK_EH_RESET_TIMER that causes the
     SCSI / block layer to reset the timer, thus giving more time to the
     LLD.

  => Currently a LLD can potentially loop infinitely on a command if it
     always keeps on returning BLK_EH_RESET_TIMER.

* => Other than choking its own devices, if the command that is stuck is a
     command issued during sd_probe_async() (e.g. a partition table scan),
     then it impacts all the disks because no other disks can be removed
     from the system until sd_probe_async() returns. (sd_remove waits on
     async_synchronize_full_domain(...))

  => This problem actually resulted in the situation mentioned above,
     whereby no disks in the system (on other scsi hosts) could be removed,
     because of a stuck scsi command to read the partition tables of an
     unrelated problematic disk during probe. The threads were stuck at:

	 schedule+0x312/0x7a0
	 async_synchronize_cookie_domain+0xb8/0x115
	 ? __wake_up_bit+0x40/0x40
	 async_synchronize_full_domain+0x15/0x17
	 sd_remove+0x5f/0x135
	 __device_release_driver+0x8a/0xe0
	 device_release_driver+0x23/0x30
	 bus_remove_device+0x10f/0x123
	 device_del+0x132/0x18e
	 __scsi_remove_device+0x56/0xb6
	 scsi_remove_device+0x26/0x33
	 scsi_remove_target+0x12d/0x1a0
	 ...

What this patch does:
  => Ensure that any quests to reset the timer are accounted for, so that
     there is a finite upper bound on the time that a command is tried.
     Once allowed number of retries is reached, we proceed to standard
     error handling procedure (abort etc.) by scheduling the command
     for EH.

Signed-off-by: Rajat Jain <rajatja@google.com>
---
 drivers/scsi/scsi_error.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index c95a4e9..9671ec5 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -283,6 +283,17 @@ enum blk_eh_timer_return scsi_times_out(struct request *req)
 	else if (host->hostt->eh_timed_out)
 		rtn = host->hostt->eh_timed_out(scmd);
 
+	/*
+	 * If a scmd times out because LLD failed to complete it, make sure that
+	 * LLD can ask for more time only finite number of times. Also each such
+	 * request must account towards the time the LLD has been spent on that
+	 * cmd. Thus each timeout attempt by an LLD to complete a scmd must be
+	 * treated as a retry since it involves waiting for another whole period
+	 * of time before it times out again.
+	 */
+	if (rtn == BLK_EH_RESET_TIMER && (++scmd->retries > scmd->allowed))
+		rtn = BLK_EH_NOT_HANDLED;
+
 	if (rtn == BLK_EH_NOT_HANDLED) {
 		if (!host->hostt->no_async_abort &&
 		    scsi_abort_command(scmd) == SUCCESS)