From patchwork Tue Dec 20 02:02:27 2016
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: Mauricio Faria de Oliveira
 <mauricfo@linux.vnet.ibm.com>
X-Patchwork-Id: 9481045
Return-Path: <linux-scsi-owner@kernel.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
	[172.30.200.125])
	by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id
	64287606DB for <patchwork-linux-scsi@patchwork.kernel.org>;
	Tue, 20 Dec 2016 02:02:45 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 55910284E8
	for <patchwork-linux-scsi@patchwork.kernel.org>;
	Tue, 20 Dec 2016 02:02:45 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id 4A333284EC; Tue, 20 Dec 2016 02:02:45 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-6.9 required=2.0 tests=BAYES_00,RCVD_IN_DNSWL_HI
	autolearn=unavailable version=3.3.1
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 6C017284E8
	for <patchwork-linux-scsi@patchwork.kernel.org>;
	Tue, 20 Dec 2016 02:02:44 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S932768AbcLTCCl (ORCPT
	<rfc822;patchwork-linux-scsi@patchwork.kernel.org>);
	Mon, 19 Dec 2016 21:02:41 -0500
Received: from mx0a-001b2d01.pphosted.com ([148.163.156.1]:52846 "EHLO
	mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-OK)
	by vger.kernel.org with ESMTP id S932969AbcLTCCk (ORCPT
	<rfc822;linux-scsi@vger.kernel.org>);
	Mon, 19 Dec 2016 21:02:40 -0500
Received: from pps.filterd (m0098393.ppops.net [127.0.0.1])
	by mx0a-001b2d01.pphosted.com (8.16.0.17/8.16.0.17) with SMTP id
	uBK1wopi123268
	for <linux-scsi@vger.kernel.org>; Mon, 19 Dec 2016 21:02:40 -0500
Received: from e24smtp01.br.ibm.com (e24smtp01.br.ibm.com [32.104.18.85])
	by mx0a-001b2d01.pphosted.com with ESMTP id 27emteve1r-1
	(version=TLSv1.2 cipher=AES256-SHA bits=256 verify=NOT)
	for <linux-scsi@vger.kernel.org>; Mon, 19 Dec 2016 21:02:39 -0500
Received: from localhost
	by e24smtp01.br.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use
	Only! Violators will be prosecuted
	for <linux-scsi@vger.kernel.org> from <mauricfo@linux.vnet.ibm.com>;
	Tue, 20 Dec 2016 00:02:36 -0200
Received: from d24dlp02.br.ibm.com (9.18.248.206)
	by e24smtp01.br.ibm.com (10.172.0.143) with IBM ESMTP SMTP Gateway:
	Authorized Use Only! Violators will be prosecuted;
	Tue, 20 Dec 2016 00:02:33 -0200
Received: from d24relay01.br.ibm.com (d24relay01.br.ibm.com [9.8.31.16])
	by d24dlp02.br.ibm.com (Postfix) with ESMTP id 71E371DC0054;
	Mon, 19 Dec 2016 21:02:33 -0500 (EST)
Received: from d24av05.br.ibm.com (d24av05.br.ibm.com [9.18.232.44])
	by d24relay01.br.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id
	uBK22WnF5087400; Tue, 20 Dec 2016 00:02:32 -0200
Received: from d24av05.br.ibm.com (localhost [127.0.0.1])
	by d24av05.br.ibm.com (8.14.4/8.14.4/NCO v10.0 AVout) with ESMTP id
	uBK22W4r015302; Tue, 20 Dec 2016 00:02:32 -0200
Received: from t440.ibm.com ([9.80.192.6])
	by d24av05.br.ibm.com (8.14.4/8.14.4/NCO v10.0 AVin) with ESMTP id
	uBK22SdF015240; Tue, 20 Dec 2016 00:02:29 -0200
From: Mauricio Faria de Oliveira <mauricfo@linux.vnet.ibm.com>
To: jejb@linux.vnet.ibm.com, martin.petersen@oracle.com,
	linux-scsi@vger.kernel.org
Cc: linux-block@vger.kernel.org, linux-kernel@vger.kernel.org,
	gpiccoli@linux.vnet.ibm.com
Subject: [PATCH] scsi: do not requeue requests unaligned with device sector
	size
Date: Tue, 20 Dec 2016 00:02:27 -0200
X-Mailer: git-send-email 1.8.3.1
MIME-Version: 1.0
X-TM-AS-MML: disable
X-Content-Scanned: Fidelis XPS MAILER
x-cbid: 16122002-1523-0000-0000-0000025B23B3
X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused
x-cbparentid: 16122002-1524-0000-0000-000029E63BF1
Message-Id: <1482199347-9128-1-git-send-email-mauricfo@linux.vnet.ibm.com>
X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:, ,
	definitions=2016-12-19_14:, , signatures=0
X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0
	spamscore=0 suspectscore=0
	malwarescore=0 phishscore=0 adultscore=0 bulkscore=0 classifier=spam
	adjust=0 reason=mlx scancount=1 engine=8.0.1-1612050000
	definitions=main-1612200021
Sender: linux-scsi-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-scsi.vger.kernel.org>
X-Mailing-List: linux-scsi@vger.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

When a SCSI command (e.g., read operation) is partially completed
with good status and residual bytes (i.e., not all the bytes from
the specified transfer length were transferred) the SCSI midlayer
will update the request/bios with the completed bytes and requeue
the request in order to complete the remainder/pending bytes.

However, when the device sector size is greater than the 512-byte
default/kernel sector size, alignment restrictions and validation
apply (both to the starting logical block address, and the number
of logical blocks to transfer) -- values must be multiples of the
device sector size, otherwise the kernel fails the request in the
preparation stage (e.g., sd_setup_read_write_cmnd() at sd.c file):

  [...] sd 0:0:0:0: [sda] Bad block number requested

Hence, this error message (and the respective failed request) can
be observed on devices with larger sector sizes which may respond
the SCSI command with a SCSI residual size that is unaligned with
the device sector size -- because the requeued request's starting
logical block and number of logical blocks are based on the value
of the remainder/pending bytes.

In order to address this problem, introduce a check for this case
in scsi_io_completion() (before it calls scsi_end_request() which
in turn calls blk_update_request() which is the site that changes
the request's __sector and __data_len fields, which are validated
by sd_setup_read_write_cmnd()).

This check verifies whether there is any residual/remainder bytes
in the (potentially partially) completed requested and calculates
the correctly aligned values for the number of completed bytes to
pass up to scsi_end_request()/blk_update_request() that guarantee
that the requeued request is aligned with the device sector size.

The corner case is when one sector is requested and the response
is partially complete, for which the remainder/pending bytes are
unaligned and no further request would be valid.  On such a case,
the original request is retried after a delay (in case the error
is hopefully due to a temporary condition in the device), but up
to the retry limit (in case the condition is permanent, e.g. bad
sector in the medium), after which the request is finally failed.

In order to reproduce and verify this problem, the virtio_scsi.c
file can be modified to respond to 3 'special' SCSI commands, on
which partial completions are introduced (described in the patch).

This is the guest's disk test image and libvirt XML snippets for
the 4k-sector disk using the virtio scsi driver:

    # qemu-img create -f qcow2 /var/lib/libvirt/images/test.qcow2 128G

    <disk type='file' device='disk'>
      <driver name='qemu' type='qcow2'/>
      <source file='/var/lib/libvirt/images/test.qcow2'/>
      <blockio logical_block_size='4096' physical_block_size='4096'/>
      <target dev='sda' bus='scsi'/>
      <address type='drive' controller='0' bus='0' target='0' unit='0'/>
    </disk>

    <controller type='scsi' index='0' model='virtio-scsi'>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x05'
function='0x0'/>
    </controller>

And the verification in the guest:

    # cat /sys/block/sda/queue/physical_block_size
    4096

    # cat /sys/block/sda/queue/hw_sector_size
    4096

This is the patch to virtio_scsi.c (lines prefixed with ' ___ '):

 ___ --- a/drivers/scsi/virtio_scsi.c
 ___ +++ b/drivers/scsi/virtio_scsi.c
 ___ @@ -153,11 +153,45 @@
 ___  	struct virtio_scsi_cmd_resp *resp = &cmd->resp.cmd;
 ___  	struct virtio_scsi_target_state *tgt =
 ___  				scsi_target(sc->device)->hostdata;
 ___ +	static int debug_failures = 0;
 ___
 ___  	dev_dbg(&sc->device->sdev_gendev,
 ___  		"cmd %p response %u status %#02x sense_len %u\n",
 ___  		sc, resp->response, resp->status, resp->sense_len);
 ___
 ___ +	// DEBUG: filter this CDB for testing purposes.
 ___ +	// CDB: Read(10) 28 00 01 02 03 xx 00 00 yy 00
 ___ +	// (xx: LSB of the LBA, and yy: LSB of the LEN)
 ___ +	if ((sc->cmnd[0] == 0x28) && (sc->cmnd[1] == 0x00)
 ___ +	&&  (sc->cmnd[2] == 0x01) && (sc->cmnd[3] == 0x02)
 ___ +	&&  (sc->cmnd[4] == 0x03) && (sc->cmnd[6] == 0x00)
 ___ +	&&  (sc->cmnd[7] == 0x00) && (sc->cmnd[9] == 0x00)) {
 ___ +
 ___ +		// Test 1:
 ___ +		// - LBA: 01 02 03 _04_
 ___ +		// - LEN: two sectors (2 * 4k = 8k)
 ___ +		// - Action: complete 5k out of 8k (3k residual)
 ___ +		if ((sc->cmnd[5] == 0x04) && (sc->cmnd[8] == 0x02))
 ___ +			resp->resid = 6 * 512;
 ___ +
 ___ +		// Test 2:
 ___ +		// - LBA: 01 02 03 _04_
 ___ +		// - LEN: one sector (1 * 4k = 4k)
 ___ +		// - Action: complete 3k out of 4k (1k residual)
 ___ +		//           always.
 ___ +		if ((sc->cmnd[5] == 0x04) && (sc->cmnd[8] == 0x01))
 ___ +			resp->resid = 2 * 512;
 ___ +
 ___ +		// Test 3:
 ___ +		// - LBA: 01 02 03 _08_
 ___ +		// - LEN: one sector (1 * 4k = 4k)
 ___ +		// - Action: complete 3k out of 4k (1k residual)
 ___ +		//           but on every 4th attempt (complete 4k)
 ___ +		if ((sc->cmnd[5] == 0x08) &&  (sc->cmnd[8] == 0x01)
 ___ +		&&  (++debug_failures % 4 != 0))
 ___ +			resp->resid = 2 * 512;
 ___ +	}
 ___ +
 ___  	sc->result = resp->status;
 ___  	virtscsi_compute_resid(sc, virtio32_to_cpu(vscsi->vdev,
 ___ resp->resid));
 ___  	switch (resp->response) {

Without this patch all the 3 tests fail w/ 'Bad block number requested'.

  Test 1)

  # dd if=/dev/sda of=/dev/null bs=8192 count=1 skip=8454530 iflag=direct
  dd: error reading ‘/dev/sda’: Input/output error
  0+0 records in
  0+0 records out
  0 bytes (0 B) copied, 0.00260138 s, 0.0 kB/s

  # dmesg
  [...] sd 0:0:0:0: [sda] Bad block number requested

And if you look with more detail using scsi_logging_level (HLQUEUE = 2
and HLCOMPLETE = 2) the unaligned requeued request is clearly visible
(on a 4k-sector device, block/count must be multiples of 8 512-byte sectors)
then it fails:

    # echo $(( (2<<21) + (2<<24) )) > /proc/sys/dev/scsi/logging_level;

    Test 1)

    # dmesg -c >/dev/null; dd if=/dev/sda of=/dev/null bs=8192 count=1
      skip=8454530 iflag=direct 2>/dev/null; dmesg
    [...] sd 0:0:0:0: [sda] sd_setup_read_write_cmnd: block=135272480, count=16
    [...] sd 0:0:0:0: [sda] tag#0 sd_done: completed 8192 of 8192 bytes
    [...] sd 0:0:0:0: [sda] tag#0 16 sectors total, 5120 bytes done.
    [...] sd 0:0:0:0: [sda] sd_setup_read_write_cmnd: block=135272490, count=6
    [...] sd 0:0:0:0: [sda] Bad block number requested
    [...] sd 0:0:0:0: tag#0 0 sectors total, 0 bytes done.

With the patch, this happily changes into another 4k-sized request:

    # dmesg -c >/dev/null; dd if=/dev/sda of=/dev/null bs=8192 count=1
      skip=8454530 iflag=direct 2>/dev/null; dmesg
    [...] sd 0:0:0:0: [sda] sd_setup_read_write_cmnd: block=135272480, count=16
    [...] sd 0:0:0:0: [sda] tag#0 sd_done: completed 8192 of 8192 bytes
    [...] sd 0:0:0:0: [sda] tag#0 16 sectors total, 5120 bytes done.
    [...] sd 0:0:0:0: [sda] tag#0 checking 5120 bytes for alignment
                            (sector size 4096, remainder 1024, resid 3072)
    [...] sd 0:0:0:0: [sda] sd_setup_read_write_cmnd: block=135272488, count=8
    [...] sd 0:0:0:0: [sda] tag#0 sd_done: completed 4096 of 4096 bytes
    [...] sd 0:0:0:0: [sda] tag#0 8 sectors total, 4096 bytes done.
    [...] sd 0:0:0:0: tag#0 0 sectors total, 0 bytes done.

And verifying the one sector corner case in which all retries are failed
by the driver (when we must finally fail the request up the stack), we
observe the original request plus five retries, then the I/O error occurs:

    Test 2)

    # dmesg -c >/dev/null; dd if=/dev/sda of=/dev/null bs=4096 count=1
      skip=16909060 iflag=direct 2>/dev/null; dmesg
    [...] sd 0:0:0:0: [sda] sd_setup_read_write_cmnd: block=135272480, count=8
    [...] sd 0:0:0:0: [sda] tag#0 sd_done: completed 4096 of 4096 bytes
    [...] sd 0:0:0:0: [sda] tag#0 8 sectors total, 3072 bytes done.
    [...] sd 0:0:0:0: [sda] tag#0 checking 3072 bytes for alignment
                            (sector size 4096, remainder 3072, resid 1024)
    [...] sd 0:0:0:0: [sda] tag#0 sd_done: completed 4096 of 4096 bytes
    [...] sd 0:0:0:0: [sda] tag#0 8 sectors total, 3072 bytes done.
    [...] sd 0:0:0:0: [sda] tag#0 checking 3072 bytes for alignment
                            (sector size 4096, remainder 3072, resid 1024)
    [...] sd 0:0:0:0: [sda] tag#0 sd_done: completed 4096 of 4096 bytes
    [...] sd 0:0:0:0: [sda] tag#0 8 sectors total, 3072 bytes done.
    [...] sd 0:0:0:0: [sda] tag#0 checking 3072 bytes for alignment
                            (sector size 4096, remainder 3072, resid 1024)
    [...] sd 0:0:0:0: [sda] tag#0 sd_done: completed 4096 of 4096 bytes
    [...] sd 0:0:0:0: [sda] tag#0 8 sectors total, 3072 bytes done.
    [...] sd 0:0:0:0: [sda] tag#0 checking 3072 bytes for alignment
                            (sector size 4096, remainder 3072, resid 1024)
    [...] sd 0:0:0:0: [sda] tag#0 sd_done: completed 4096 of 4096 bytes
    [...] sd 0:0:0:0: [sda] tag#0 8 sectors total, 3072 bytes done.
    [...] sd 0:0:0:0: [sda] tag#0 checking 3072 bytes for alignment
                            (sector size 4096, remainder 3072, resid 1024)
    [...] sd 0:0:0:0: [sda] tag#0 sd_done: completed 4096 of 4096 bytes
    [...] sd 0:0:0:0: [sda] tag#0 8 sectors total, 3072 bytes done.
    [...] sd 0:0:0:0: [sda] tag#0 checking 3072 bytes for alignment
                            (sector size 4096, remainder 3072, resid 1024)
    [...] sd 0:0:0:0: [sda] tag#0 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK
    [...] sd 0:0:0:0: [sda] tag#0 CDB: Read(10) 28 00 01 02 03 04 00 00 01 00
    [...] blk_update_request: I/O error, dev sda, sector 135272480
    [...] sd 0:0:0:0: tag#0 0 sectors total, 0 bytes done.

And now the one sector corner case when eventually one retry succeeds
(the original request plus three retries, of which the last retry passes)
and the request is successfully passed up the stack and finished:

    Test 3)

    # dmesg -c >/dev/null; dd if=/dev/sda of=/dev/null bs=4096 count=1
      skip=16909064 iflag=direct 2>/dev/null; dmesg
    [...] sd 0:0:0:0: [sda] sd_setup_read_write_cmnd: block=135272512, count=8
    [...] sd 0:0:0:0: [sda] tag#0 sd_done: completed 4096 of 4096 bytes
    [...] sd 0:0:0:0: [sda] tag#0 8 sectors total, 3072 bytes done.
    [...] sd 0:0:0:0: [sda] tag#0 checking 3072 bytes for alignment
                      (sector size 4096, remainder 3072, resid 1024)
    [...] sd 0:0:0:0: [sda] tag#0 sd_done: completed 4096 of 4096 bytes
    [...] sd 0:0:0:0: [sda] tag#0 8 sectors total, 3072 bytes done.
    [...] sd 0:0:0:0: [sda] tag#0 checking 3072 bytes for alignment
                      (sector size 4096, remainder 3072, resid 1024)
    [...] sd 0:0:0:0: [sda] tag#0 sd_done: completed 4096 of 4096 bytes
    [...] sd 0:0:0:0: [sda] tag#0 8 sectors total, 3072 bytes done.
    [...] sd 0:0:0:0: [sda] tag#0 checking 3072 bytes for alignment
                      (sector size 4096, remainder 3072, resid 1024)
    [...] sd 0:0:0:0: [sda] tag#0 sd_done: completed 4096 of 4096 bytes
    [...] sd 0:0:0:0: [sda] tag#0 8 sectors total, 4096 bytes done.
    [...] sd 0:0:0:0: tag#0 0 sectors total, 0 bytes done.

Apologies for the ridiculously long commit message with description and
test-cases, but this problem has been relatively difficult to reproduce
and understand, so I thought the documentation/instructions for working
with that could be helpful for others too.

Signed-off-by: Mauricio Faria de Oliveira <mauricfo@linux.vnet.ibm.com>
Tested-by: Guilherme G. Piccoli <gpiccoli@linux.vnet.ibm.com>
---
 drivers/scsi/scsi_lib.c | 41 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 41 insertions(+)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 2cca9cffc63f..190ab28cfb2d 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -821,6 +821,45 @@ void scsi_io_completion(struct scsi_cmnd *cmd, unsigned int good_bytes)
 	}
 
 	/*
+	 * If the SCSI command is successful but has residual bytes,
+	 * align good_bytes to the device sector size to ensure the
+	 * requeued request (to complete the remainder transfer) is
+	 * also aligned and does not fail alignment/size validation.
+	 */
+	if (unlikely(!result && scsi_get_resid(cmd) &&
+		     req->cmd_type == REQ_TYPE_FS)) {
+
+		unsigned int sector_size = cmd->device->sector_size;
+		unsigned int remainder = good_bytes % sector_size;
+		int resid = scsi_get_resid(cmd);
+
+		SCSI_LOG_HLCOMPLETE(1, scmd_printk(KERN_INFO, cmd,
+			"checking %d bytes for alignment "
+			"(sector size %u, remainder %u, resid %d)\n",
+			good_bytes, sector_size, remainder, resid));
+
+		if (remainder) {
+			good_bytes -= remainder;
+			resid += remainder;
+			scsi_set_resid(cmd, resid);
+
+			/*
+			 * If less than one device sector has completed, retry the
+			 * request after delay (up to the retry limit) or timeout.
+			 */
+			if (!good_bytes) {
+				if ((++cmd->retries) <= cmd->allowed
+				&& !scsi_noretry_cmd(cmd)) {
+					goto delayed_retry;
+				} else {
+					set_host_byte(cmd, DID_TIME_OUT);
+					goto error;
+				}
+			}
+		}
+	}
+
+	/*
 	 * special case: failed zero length commands always need to
 	 * drop down into the retry code. Otherwise, if we finished
 	 * all bytes in the request we are done now.
@@ -845,6 +884,7 @@ void scsi_io_completion(struct scsi_cmnd *cmd, unsigned int good_bytes)
 	if (result == 0)
 		goto requeue;
 
+	error:
 	error = __scsi_error_from_host_byte(cmd, result);
 
 	if (host_byte(result) == DID_RESET) {
@@ -985,6 +1025,7 @@ void scsi_io_completion(struct scsi_cmnd *cmd, unsigned int good_bytes)
 		__scsi_queue_insert(cmd, SCSI_MLQUEUE_EH_RETRY, 0);
 		break;
 	case ACTION_DELAYED_RETRY:
+	delayed_retry:
 		/* Retry the same command after a delay */
 		__scsi_queue_insert(cmd, SCSI_MLQUEUE_DEVICE_BUSY, 0);
 		break;