From patchwork Mon Jun 20 19:10:28 2011
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Russell King - ARM Linux <linux@arm.linux.org.uk>
X-Patchwork-Id: 898102
Received: from merlin.infradead.org (merlin.infradead.org [205.233.59.134])
	by demeter2.kernel.org (8.14.4/8.14.4) with ESMTP id p5KJBJo4023810
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO)
	for <patchwork-linux-arm@patchwork.kernel.org>;
	Mon, 20 Jun 2011 19:11:40 GMT
Received: from canuck.infradead.org ([2001:4978:20e::1])
	by merlin.infradead.org with esmtps (Exim 4.76 #1 (Red Hat Linux))
	id 1QYjsI-0007ro-Hu; Mon, 20 Jun 2011 19:11:07 +0000
Received: from localhost ([127.0.0.1] helo=canuck.infradead.org)
	by canuck.infradead.org with esmtp (Exim 4.76 #1 (Red Hat Linux))
	id 1QYjsH-00006Y-RE; Mon, 20 Jun 2011 19:11:06 +0000
Received: from [2002:4e20:1eda::1] (helo=caramon.arm.linux.org.uk)
	by canuck.infradead.org with esmtps (Exim 4.76 #1 (Red Hat Linux))
	id 1QYjs1-0008Ux-MI for linux-arm-kernel@lists.infradead.org;
	Mon, 20 Jun 2011 19:10:54 +0000
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
	d=arm.linux.org.uk; s=caramon;
	h=Date:Sender:Message-Id:Content-Type:MIME-Version:Cc:Subject:Cc:To:From:References:In-Reply-To;
	bh=tO1pPuKfSym3FxvO7uIH7Jnk7z2jmj6BbxpnrtXR8Og=;
	b=hZakjAm5oG7aXov+d7BZKDNMNcvY0d+CJ5isKSceA0a/tcubHmnwSL7lg8i4hg0rWvXje2iawlXVBHOYvybKlRQEDuV6DenTK1QNbS/cOAzmXAVCu4vQZq5KgwpcfgSgEXm30OeIoDP9w+IoJI67XiBXXHNavYnZ8Fm8KL7XjLo=;
Received: from e0022681537dd.dyn.arm.linux.org.uk
	([2002:4e20:1eda:1:222:68ff:fe15:37dd] helo=rmk-PC.arm.linux.org.uk)
	by caramon.arm.linux.org.uk with esmtpsa (TLSv1:AES256-SHA:256)
	(Exim 4.72) (envelope-from <rmk@arm.linux.org.uk>)
	id 1QYjri-0001Xg-3V; Mon, 20 Jun 2011 20:10:30 +0100
Received: from rmk by rmk-PC.arm.linux.org.uk with local (Exim 4.76)
	(envelope-from <rmk@arm.linux.org.uk>)
	id 1QYjrg-000802-Ko; Mon, 20 Jun 2011 20:10:28 +0100
In-Reply-To: <20110620190945.GK26089@n2100.arm.linux.org.uk>
References: <20110620190945.GK26089@n2100.arm.linux.org.uk>
From: Russell King - ARM Linux <linux@arm.linux.org.uk>
To: Chris Ball <cjb@laptop.org>
Subject: [PATCH 2/3] MMC: improve error recovery from command channel errors
MIME-Version: 1.0
Content-Disposition: inline
Message-Id: <E1QYjrg-000802-Ko@rmk-PC.arm.linux.org.uk>
Date: Mon, 20 Jun 2011 20:10:28 +0100
X-CRM114-Version: 20090807-BlameThorstenAndJenny ( TRE 0.7.6 (BSD) )
	MR-646709E3
X-CRM114-CacheID: sfid-20110620_151050_875891_18670F4D 
X-CRM114-Status: GOOD (  32.70  )
X-Spam-Score: 1.2 (+)
X-Spam-Report: SpamAssassin version 3.3.1 on canuck.infradead.org summary:
	Content analysis details:   (1.2 points)
	pts rule name              description
	---- ----------------------
	--------------------------------------------------
	-0.1 DKIM_VALID_AU Message has a valid DKIM or DK signature from
	author's domain
	0.1 DKIM_SIGNED            Message has a DKIM or DK signature,
	not necessarily valid
	-0.1 DKIM_VALID Message has at least one valid DKIM or DK signature
	1.3 RDNS_NONE Delivered to internal network by a host with no rDNS
Cc: linux-mmc@vger.kernel.org, linux-arm-kernel@lists.infradead.org
X-BeenThere: linux-arm-kernel@lists.infradead.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: <linux-arm-kernel.lists.infradead.org>
List-Unsubscribe: 
 <http://lists.infradead.org/mailman/options/linux-arm-kernel>,
	<mailto:linux-arm-kernel-request@lists.infradead.org?subject=unsubscribe>
List-Archive: <http://lists.infradead.org/pipermail/linux-arm-kernel/>
List-Post: <mailto:linux-arm-kernel@lists.infradead.org>
List-Help: <mailto:linux-arm-kernel-request@lists.infradead.org?subject=help>
List-Subscribe: 
 <http://lists.infradead.org/mailman/listinfo/linux-arm-kernel>,
	<mailto:linux-arm-kernel-request@lists.infradead.org?subject=subscribe>
Sender: linux-arm-kernel-bounces@lists.infradead.org
Errors-To: 
 linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org
X-Greylist: IP, sender and recipient auto-whitelisted, not delayed by
	milter-greylist-4.2.6 (demeter2.kernel.org [140.211.167.43]);
	Mon, 20 Jun 2011 19:11:40 +0000 (UTC)

Command channel errors fall into four classes:

1. The command was issued with the card in the wrong state
2. The command failed to be received by the card correctly
3. The cards response failed to be received by the host (CRC error)
4. The card failed to respond to the card

For (1), in theory we should know that the card is in the correct state.
However, a failed stop command (or other failure) may result in the card
remaining in a data transfer state from the previous command.  If we
detect this condition, we try to recover by sending a stop command.

For the initial commands (set block count and the read/write command)
no data will have been transferred.  All that we need deal with is
retrying at this point.  A failed stop command can be remedied as
above.

If we are unable to recover the card (eg, the card ignores our requests
for status, or we don't recognise the error code) then we immediately
fail the request.

Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
---
 drivers/mmc/card/block.c |  230 ++++++++++++++++++++++++++++++++++++----------
 1 files changed, 182 insertions(+), 48 deletions(-)

diff --git a/drivers/mmc/card/block.c b/drivers/mmc/card/block.c
index 022edc3..aa074c8 100644
--- a/drivers/mmc/card/block.c
+++ b/drivers/mmc/card/block.c
@@ -525,6 +525,19 @@ static u32 mmc_sd_num_wr_blocks(struct mmc_card *card)
 	return result;
 }
 
+static int send_stop(struct mmc_card *card, u32 *status)
+{
+	struct mmc_command cmd = {0};
+	int err;
+
+	cmd.opcode = MMC_STOP_TRANSMISSION;
+	cmd.flags = MMC_RSP_SPI_R1B | MMC_RSP_R1B | MMC_CMD_AC;
+	err = mmc_wait_for_cmd(card->host, &cmd, 5);
+	if (err == 0)
+		*status = cmd.resp[0];
+	return err;
+}
+
 static int get_card_status(struct mmc_card *card, u32 *status, int retries)
 {
 	struct mmc_command cmd = {0};
@@ -540,6 +553,137 @@ static int get_card_status(struct mmc_card *card, u32 *status, int retries)
 	return err;
 }
 
+#define ERR_RETRY	2
+#define ERR_ABORT	1
+#define ERR_CONTINUE	0
+
+static int mmc_blk_cmd_error(struct request *req, const char *name, int error,
+	bool status_valid, u32 status)
+{
+	switch (error) {
+	case -EILSEQ:
+		/* response crc error, retry the r/w cmd */
+		pr_err("%s: %s sending %s command, card status %#x\n",
+			req->rq_disk->disk_name, "response CRC error",
+			name, status);
+		return ERR_RETRY;
+
+	case -ETIMEDOUT:
+		pr_err("%s: %s sending %s command, card status %#x\n",
+			req->rq_disk->disk_name, "timed out", name, status);
+
+		/* If the status cmd initially failed, retry the r/w cmd */
+		if (!status_valid)
+			return ERR_RETRY;
+
+		/*
+		 * If it was a r/w cmd crc error, or illegal command
+		 * (eg, issued in wrong state) then retry - we should
+		 * have corrected the state problem above.
+		 */
+		if (status & (R1_COM_CRC_ERROR | R1_ILLEGAL_COMMAND))
+			return ERR_RETRY;
+
+		/* Otherwise abort the command */
+		return ERR_ABORT;
+
+	default:
+		/* We don't understand the error code the driver gave us */
+		pr_err("%s: unknown error %d sending read/write command, card status %#x\n",
+		       req->rq_disk->disk_name, error, status);
+		return ERR_ABORT;
+	}
+}
+
+/*
+ * Initial r/w and stop cmd error recovery.
+ * We don't know whether the card received the r/w cmd or not, so try to
+ * restore things back to a sane state.  Essentially, we do this as follows:
+ * - Obtain card status.  If the first attempt to obtain card status fails,
+ *   the status word will reflect the failed status cmd, not the failed
+ *   r/w cmd.  If we fail to obtain card status, it suggests we can no
+ *   longer communicate with the card.
+ * - Check the card state.  If the card received the cmd but there was a
+ *   transient problem with the response, it might still be in a data transfer
+ *   mode.  Try to send it a stop command.  If this fails, we can't recover.
+ * - If the r/w cmd failed due to a response CRC error, it was probably
+ *   transient, so retry the cmd.
+ * - If the r/w cmd timed out, but we didn't get the r/w cmd status, retry.
+ * - If the r/w cmd timed out, and the r/w cmd failed due to CRC error or
+ *   illegal cmd, retry.
+ * Otherwise we don't understand what happened, so abort.
+ */
+static int mmc_blk_cmd_recovery(struct mmc_card *card, struct request *req,
+	struct mmc_blk_request *brq)
+{
+	bool prev_cmd_status_valid = true;
+	u32 status, stop_status = 0;
+	int err, retry;
+
+	/*
+	 * Try to get card status which indicates both the card state
+	 * and why there was no response.  If the first attempt fails,
+	 * we can't be sure the returned status is for the r/w command.
+	 */
+	for (retry = 2; retry >= 0; retry--) {
+		err = get_card_status(card, &status, 0);
+		if (!err)
+			break;
+
+		prev_cmd_status_valid = false;
+		pr_err("%s: error %d sending status command, %sing\n",
+		       req->rq_disk->disk_name, err, retry ? "retry" : "abort");
+	}
+
+	/* We couldn't get a response from the card.  Give up. */
+	if (err)
+		return ERR_ABORT;
+
+	/*
+	 * Check the current card state.  If it is in some data transfer
+	 * mode, tell it to stop (and hopefully transition back to TRAN.)
+	 */
+	if (R1_CURRENT_STATE(status) == R1_STATE_DATA ||
+	    R1_CURRENT_STATE(status) == R1_STATE_RCV) {
+		err = send_stop(card, &stop_status);
+		if (err)
+			pr_err("%s: error %d sending stop command\n",
+			       req->rq_disk->disk_name, err);
+
+		/*
+		 * If the stop cmd also timed out, the card is probably
+		 * not present, so abort.  Other errors are bad news too.
+		 */
+		if (err)
+			return ERR_ABORT;
+	}
+
+	/* Check for set block count errors */
+	if (brq->sbc.error)
+		return mmc_blk_cmd_error(req, "SET_BLOCK_COUNT", brq->sbc.error,
+				prev_cmd_status_valid, status);
+
+	/* Check for r/w command errors */
+	if (brq->cmd.error)
+		return mmc_blk_cmd_error(req, "r/w cmd", brq->cmd.error,
+				prev_cmd_status_valid, status);
+
+	/* Now for stop errors.  These aren't fatal to the transfer. */
+	pr_err("%s: error %d sending stop command, original cmd response %#x, card status %#x\n",
+	       req->rq_disk->disk_name, brq->stop.error,
+	       brq->cmd.resp[0], status);
+
+	/*
+	 * Subsitute in our own stop status as this will give the error
+	 * state which happened during the execution of the r/w command.
+	 */
+	if (stop_status) {
+		brq->stop.resp[0] = stop_status;
+		brq->stop.error = 0;
+	}
+	return ERR_CONTINUE;
+}
+
 static int mmc_blk_issue_discard_rq(struct mmc_queue *mq, struct request *req)
 {
 	struct mmc_blk_data *md = mq->data;
@@ -673,7 +817,7 @@ static int mmc_blk_issue_rw_rq(struct mmc_queue *mq, struct request *req)
 	struct mmc_blk_data *md = mq->data;
 	struct mmc_card *card = md->queue.card;
 	struct mmc_blk_request brq;
-	int ret = 1, disable_multi = 0;
+	int ret = 1, disable_multi = 0, retry = 0;
 
 	/*
 	 * Reliable writes are used to implement Forced Unit Access and
@@ -685,7 +829,7 @@ static int mmc_blk_issue_rw_rq(struct mmc_queue *mq, struct request *req)
 		(md->flags & MMC_BLK_REL_WR);
 
 	do {
-		u32 readcmd, writecmd, status = 0;
+		u32 readcmd, writecmd;
 
 		memset(&brq, 0, sizeof(struct mmc_blk_request));
 		brq.mrq.cmd = &brq.cmd;
@@ -802,55 +946,29 @@ static int mmc_blk_issue_rw_rq(struct mmc_queue *mq, struct request *req)
 		mmc_queue_bounce_post(mq);
 
 		/*
-		 * Check for errors here, but don't jump to cmd_err
-		 * until later as we need to wait for the card to leave
-		 * programming mode even when things go wrong.
+		 * sbc.error indicates a problem with the set block count
+		 * command.  No data will have been transferred.
+		 *
+		 * cmd.error indicates a problem with the r/w command.  No
+		 * data will have been transferred.
+		 *
+		 * stop.error indicates a problem with the stop command.  Data
+		 * may have been transferred, or may still be transferring.
 		 */
-		if (brq.sbc.error || brq.cmd.error ||
-		    brq.data.error || brq.stop.error) {
-			if (brq.data.blocks > 1 && rq_data_dir(req) == READ) {
-				/* Redo read one sector at a time */
-				printk(KERN_WARNING "%s: retrying using single "
-				       "block read\n", req->rq_disk->disk_name);
-				disable_multi = 1;
-				continue;
+		if (brq.sbc.error || brq.cmd.error || brq.stop.error) {
+			switch (mmc_blk_cmd_recovery(card, req, &brq)) {
+			case ERR_RETRY:
+				if (retry++ < 5)
+					continue;
+			case ERR_ABORT:
+				goto cmd_abort;
+			case ERR_CONTINUE:
+				break;
 			}
-			get_card_status(card, &status, 0);
-		}
-
-		if (brq.sbc.error) {
-			printk(KERN_ERR "%s: error %d sending SET_BLOCK_COUNT "
-			       "command, response %#x, card status %#x\n",
-			       req->rq_disk->disk_name, brq.sbc.error,
-			       brq.sbc.resp[0], status);
-		}
-
-		if (brq.cmd.error) {
-			printk(KERN_ERR "%s: error %d sending read/write "
-			       "command, response %#x, card status %#x\n",
-			       req->rq_disk->disk_name, brq.cmd.error,
-			       brq.cmd.resp[0], status);
-		}
-
-		if (brq.data.error) {
-			if (brq.data.error == -ETIMEDOUT && brq.mrq.stop)
-				/* 'Stop' response contains card status */
-				status = brq.mrq.stop->resp[0];
-			printk(KERN_ERR "%s: error %d transferring data,"
-			       " sector %u, nr %u, card status %#x\n",
-			       req->rq_disk->disk_name, brq.data.error,
-			       (unsigned)blk_rq_pos(req),
-			       (unsigned)blk_rq_sectors(req), status);
-		}
-
-		if (brq.stop.error) {
-			printk(KERN_ERR "%s: error %d sending stop command, "
-			       "response %#x, card status %#x\n",
-			       req->rq_disk->disk_name, brq.stop.error,
-			       brq.stop.resp[0], status);
 		}
 
 		if (!mmc_host_is_spi(card->host) && rq_data_dir(req) != READ) {
+			u32 status;
 			do {
 				int err = get_card_status(card, &status, 5);
 				if (err) {
@@ -867,8 +985,22 @@ static int mmc_blk_issue_rw_rq(struct mmc_queue *mq, struct request *req)
 				 (R1_CURRENT_STATE(status) == R1_STATE_PRG));
 		}
 
-		if (brq.cmd.error || brq.stop.error || brq.data.error) {
+		if (brq.data.error) {
+			pr_err("%s: error %d transferring data, sector %u nr %u, cmd response %#x card status %#x\n",
+				req->rq_disk->disk_name, brq.data.error,
+				(unsigned)blk_rq_pos(req),
+				(unsigned)blk_rq_sectors(req),
+				brq.cmd.resp[0], brq.stop.resp[0]);
+
 			if (rq_data_dir(req) == READ) {
+				if (brq.data.blocks > 1) {
+					/* Redo read one sector at a time */
+					pr_warning("%s: retrying using single block read\n",
+						req->rq_disk->disk_name);
+					disable_multi = 1;
+					continue;
+				}
+
 				/*
 				 * After an error, we redo I/O one sector at a
 				 * time, so we only reach here after trying to
@@ -878,8 +1010,9 @@ static int mmc_blk_issue_rw_rq(struct mmc_queue *mq, struct request *req)
 				ret = __blk_end_request(req, -EIO, brq.data.blksz);
 				spin_unlock_irq(&md->lock);
 				continue;
+			} else {
+				goto cmd_err;
 			}
-			goto cmd_err;
 		}
 
 		/*
@@ -916,6 +1049,7 @@ static int mmc_blk_issue_rw_rq(struct mmc_queue *mq, struct request *req)
 		spin_unlock_irq(&md->lock);
 	}
 
+ cmd_abort:
 	spin_lock_irq(&md->lock);
 	while (ret)
 		ret = __blk_end_request(req, -EIO, blk_rq_cur_bytes(req));