From patchwork Tue Sep 19 22:24:33 2017
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Coly Li <colyli@suse.de>
X-Patchwork-Id: 9960403
Return-Path: <linux-block-owner@kernel.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
	[172.30.200.125])
	by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id
	626EA6038F for <patchwork-linux-block@patchwork.kernel.org>;
	Tue, 19 Sep 2017 22:25:03 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 4B3C328179
	for <patchwork-linux-block@patchwork.kernel.org>;
	Tue, 19 Sep 2017 22:25:03 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id 3E94C283ED; Tue, 19 Sep 2017 22:25:03 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-6.9 required=2.0 tests=BAYES_00,RCVD_IN_DNSWL_HI
	autolearn=ham version=3.3.1
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 6F4C328179
	for <patchwork-linux-block@patchwork.kernel.org>;
	Tue, 19 Sep 2017 22:25:02 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1751512AbdISWZA (ORCPT
	<rfc822;patchwork-linux-block@patchwork.kernel.org>);
	Tue, 19 Sep 2017 18:25:00 -0400
Received: from mx2.suse.de ([195.135.220.15]:42508 "EHLO mx1.suse.de"
	rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP
	id S1751361AbdISWY7 (ORCPT <rfc822;linux-block@vger.kernel.org>);
	Tue, 19 Sep 2017 18:24:59 -0400
X-Virus-Scanned: by amavisd-new at test-mx.suse.de
Received: from relay2.suse.de (charybdis-ext.suse.de [195.135.220.254])
	by mx1.suse.de (Postfix) with ESMTP id 322E15CB35;
	Tue, 19 Sep 2017 22:24:58 +0000 (UTC)
From: Coly Li <colyli@suse.de>
To: linux-bcache@vger.kernel.org, linux-block@vger.kernel.org
Cc: Coly Li <colyli@suse.de>, Nix <nix@esperi.org.uk>,
	Kai Krakow <hurikhan77@gmail.com>,
	Eric Wheeler <bcache@lists.ewheeler.net>,
	Junhui Tang <tang.junhui@zte.com.cn>, stable@vger.kernel.org
Subject: [PATCHv2] bcache: option for allow stale data on read failure
Date: Wed, 20 Sep 2017 06:24:33 +0800
Message-Id: <20170919222433.24336-1-colyli@suse.de>
X-Mailer: git-send-email 2.13.5
Sender: linux-block-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-block.vger.kernel.org>
X-Mailing-List: linux-block@vger.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

When bcache does read I/Os, for example in writeback or writethrough mode,
if a read request on cache device is failed, bcache will try to recovery
the request by reading from cached device. If the data on cached device is
not synced with cache device, then requester will get a stale data.

For critical storage system like database, providing stale data from
recovery may result an application level data corruption, which is
unacceptible. But for some other situation like multi-media stream cache,
continuous service may be more important and it is acceptible to fetch
a chunk of stale data.

This patch tries to solve the above conflict by adding a sysfs option
	/sys/block/bcache<idx>/bcache/allow_stale_data_on_failure
which is defaultly cleared (to 0) as disabled. Now people can make choices
for different situations.

With this patch, for a failed read request in writeback or writethrough
mode, recovery a recoverable read request only happens in one of the
following conditions,
 - dc->has_dirty is zero. It means all data on cache device is synced to
   cached device, the recoveried data is up-to-date. 
 - dc->has_dirty is non-zero, and dc->allow_stale_data_on_failure is set
   to 1. It means there is dirty data not synced to cached device yet, but
   option allow_stale_data_on_failure is set, receiving stale data is
   explicitly acceptible for requester.

For other cache modes in bcache, read request will never hit
cached_dev_read_error(), they don't need this patch.

Please note, because cache mode can be switched arbitrarily in run time, a
writethrough mode might be switched from a writeback mode. Therefore
checking dc->has_data in writethrough mode still makes sense.

Changelog:
v2: rename sysfs entry from allow_stale_data_on_failure  to
    allow_stale_data_on_failure, and fix the confusing commit log.
v1: initial patch posted.

Signed-off-by: Coly Li <colyli@suse.de>
Reported-by: Arne Wolf <awolf@lenovo.com>
Cc: Nix <nix@esperi.org.uk>
Cc: Kai Krakow <hurikhan77@gmail.com>
Cc: Eric Wheeler <bcache@lists.ewheeler.net>
Cc: Junhui Tang <tang.junhui@zte.com.cn>
Cc: stable@vger.kernel.org
---
 drivers/md/bcache/bcache.h  |  1 +
 drivers/md/bcache/request.c | 14 +++++++++++++-
 drivers/md/bcache/sysfs.c   |  4 ++++
 3 files changed, 18 insertions(+), 1 deletion(-)
diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index dee542fff68e..f26b174f409a 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -356,6 +356,7 @@ struct cached_dev {
 	unsigned		partial_stripes_expensive:1;
 	unsigned		writeback_metadata:1;
 	unsigned		writeback_running:1;
+	unsigned		allow_stale_data_on_failure:1;
 	unsigned char		writeback_percent;
 	unsigned		writeback_delay;
 
diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
index 019b3df9f1c6..becbc0959ca2 100644
--- a/drivers/md/bcache/request.c
+++ b/drivers/md/bcache/request.c
@@ -702,8 +702,20 @@ static void cached_dev_read_error(struct closure *cl)
 {
 	struct search *s = container_of(cl, struct search, cl);
 	struct bio *bio = &s->bio.bio;
+	struct cached_dev *dc = container_of(s->d, struct cached_dev, disk);
+	int recovery_stale_data = dc ? dc->allow_stale_data_on_failure : 0;
 
-	if (s->recoverable) {
+	/*
+	 * If dc->has_dirty is non-zero and the recovering data is on cache
+	 * device, then recover from cached device will return a stale data
+	 * to requester. But in some cases people accept stale data to avoid
+	 * a -EIO. So I/O error recovery only happens when,
+	 * - No dirty data on cache device.
+	 * - Cached device is dirty but sysfs allow_stale_data_on_failure is
+	 *   explicitly set (to 1) to accept stale data from recovery.
+	 */
+	if (s->recoverable &&
+	    (!atomic_read(&dc->has_dirty) || recovery_stale_data)) {
 		/* Retry from the backing device: */
 		trace_bcache_read_retry(s->orig_bio);
 
diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c
index f90f13616980..8603756005a8 100644
--- a/drivers/md/bcache/sysfs.c
+++ b/drivers/md/bcache/sysfs.c
@@ -106,6 +106,7 @@ rw_attribute(cache_replacement_policy);
 rw_attribute(btree_shrinker_disabled);
 rw_attribute(copy_gc_enabled);
 rw_attribute(size);
+rw_attribute(allow_stale_data_on_failure);
 
 SHOW(__bch_cached_dev)
 {
@@ -125,6 +126,7 @@ SHOW(__bch_cached_dev)
 	var_printf(bypass_torture_test,	"%i");
 	var_printf(writeback_metadata,	"%i");
 	var_printf(writeback_running,	"%i");
+	var_printf(allow_stale_data_on_failure,"%i");
 	var_print(writeback_delay);
 	var_print(writeback_percent);
 	sysfs_hprint(writeback_rate,	dc->writeback_rate.rate << 9);
@@ -201,6 +203,7 @@ STORE(__cached_dev)
 #define d_strtoi_h(var)		sysfs_hatoi(var, dc->var)
 
 	sysfs_strtoul(data_csum,	dc->disk.data_csum);
+	d_strtoul(allow_stale_data_on_failure);
 	d_strtoul(verify);
 	d_strtoul(bypass_torture_test);
 	d_strtoul(writeback_metadata);
@@ -335,6 +338,7 @@ static struct attribute *bch_cached_dev_files[] = {
 	&sysfs_verify,
 	&sysfs_bypass_torture_test,
 #endif
+	&sysfs_allow_stale_data_on_failure,
 	NULL
 };
 KTYPE(bch_cached_dev);