Message ID | 20170919222433.24336-1-colyli@suse.de (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Coly-- It's an interesting changeset. I am not positive if it will work in practice-- the most likely objects to be cached are filesystem metadata. Won't most filesystems fall apart if some of their data structures revert back to an earlier point of time? Mike On Tue, Sep 19, 2017 at 3:24 PM, Coly Li <colyli@suse.de> wrote: > When bcache does read I/Os, for example in writeback or writethrough mode, > if a read request on cache device is failed, bcache will try to recovery > the request by reading from cached device. If the data on cached device is > not synced with cache device, then requester will get a stale data. > > For critical storage system like database, providing stale data from > recovery may result an application level data corruption, which is > unacceptible. But for some other situation like multi-media stream cache, > continuous service may be more important and it is acceptible to fetch > a chunk of stale data. > > This patch tries to solve the above conflict by adding a sysfs option > /sys/block/bcache<idx>/bcache/allow_stale_data_on_failure > which is defaultly cleared (to 0) as disabled. Now people can make choices > for different situations. > > With this patch, for a failed read request in writeback or writethrough > mode, recovery a recoverable read request only happens in one of the > following conditions, > - dc->has_dirty is zero. It means all data on cache device is synced to > cached device, the recoveried data is up-to-date. > - dc->has_dirty is non-zero, and dc->allow_stale_data_on_failure is set > to 1. It means there is dirty data not synced to cached device yet, but > option allow_stale_data_on_failure is set, receiving stale data is > explicitly acceptible for requester. > > For other cache modes in bcache, read request will never hit > cached_dev_read_error(), they don't need this patch. > > Please note, because cache mode can be switched arbitrarily in run time, a > writethrough mode might be switched from a writeback mode. Therefore > checking dc->has_data in writethrough mode still makes sense. > > Changelog: > v2: rename sysfs entry from allow_stale_data_on_failure to > allow_stale_data_on_failure, and fix the confusing commit log. > v1: initial patch posted. > > Signed-off-by: Coly Li <colyli@suse.de> > Reported-by: Arne Wolf <awolf@lenovo.com> > Cc: Nix <nix@esperi.org.uk> > Cc: Kai Krakow <hurikhan77@gmail.com> > Cc: Eric Wheeler <bcache@lists.ewheeler.net> > Cc: Junhui Tang <tang.junhui@zte.com.cn> > Cc: stable@vger.kernel.org > --- > drivers/md/bcache/bcache.h | 1 + > drivers/md/bcache/request.c | 14 +++++++++++++- > drivers/md/bcache/sysfs.c | 4 ++++ > 3 files changed, 18 insertions(+), 1 deletion(-) > > diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h > index dee542fff68e..f26b174f409a 100644 > --- a/drivers/md/bcache/bcache.h > +++ b/drivers/md/bcache/bcache.h > @@ -356,6 +356,7 @@ struct cached_dev { > unsigned partial_stripes_expensive:1; > unsigned writeback_metadata:1; > unsigned writeback_running:1; > + unsigned allow_stale_data_on_failure:1; > unsigned char writeback_percent; > unsigned writeback_delay; > > diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c > index 019b3df9f1c6..becbc0959ca2 100644 > --- a/drivers/md/bcache/request.c > +++ b/drivers/md/bcache/request.c > @@ -702,8 +702,20 @@ static void cached_dev_read_error(struct closure *cl) > { > struct search *s = container_of(cl, struct search, cl); > struct bio *bio = &s->bio.bio; > + struct cached_dev *dc = container_of(s->d, struct cached_dev, disk); > + int recovery_stale_data = dc ? dc->allow_stale_data_on_failure : 0; > > - if (s->recoverable) { > + /* > + * If dc->has_dirty is non-zero and the recovering data is on cache > + * device, then recover from cached device will return a stale data > + * to requester. But in some cases people accept stale data to avoid > + * a -EIO. So I/O error recovery only happens when, > + * - No dirty data on cache device. > + * - Cached device is dirty but sysfs allow_stale_data_on_failure is > + * explicitly set (to 1) to accept stale data from recovery. > + */ > + if (s->recoverable && > + (!atomic_read(&dc->has_dirty) || recovery_stale_data)) { > /* Retry from the backing device: */ > trace_bcache_read_retry(s->orig_bio); > > diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c > index f90f13616980..8603756005a8 100644 > --- a/drivers/md/bcache/sysfs.c > +++ b/drivers/md/bcache/sysfs.c > @@ -106,6 +106,7 @@ rw_attribute(cache_replacement_policy); > rw_attribute(btree_shrinker_disabled); > rw_attribute(copy_gc_enabled); > rw_attribute(size); > +rw_attribute(allow_stale_data_on_failure); > > SHOW(__bch_cached_dev) > { > @@ -125,6 +126,7 @@ SHOW(__bch_cached_dev) > var_printf(bypass_torture_test, "%i"); > var_printf(writeback_metadata, "%i"); > var_printf(writeback_running, "%i"); > + var_printf(allow_stale_data_on_failure,"%i"); > var_print(writeback_delay); > var_print(writeback_percent); > sysfs_hprint(writeback_rate, dc->writeback_rate.rate << 9); > @@ -201,6 +203,7 @@ STORE(__cached_dev) > #define d_strtoi_h(var) sysfs_hatoi(var, dc->var) > > sysfs_strtoul(data_csum, dc->disk.data_csum); > + d_strtoul(allow_stale_data_on_failure); > d_strtoul(verify); > d_strtoul(bypass_torture_test); > d_strtoul(writeback_metadata); > @@ -335,6 +338,7 @@ static struct attribute *bch_cached_dev_files[] = { > &sysfs_verify, > &sysfs_bypass_torture_test, > #endif > + &sysfs_allow_stale_data_on_failure, > NULL > }; > KTYPE(bch_cached_dev); > -- > 2.13.5 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-bcache" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html
On 2017/9/20 上午8:59, Michael Lyle wrote: > Coly-- > > It's an interesting changeset. Hi Mike, Yes it's interesting :-) It fixes a silent database data corruption in our product kernel. The most dangerous point is, it happens silent even in-data checksum is used, this issue is detected by out-of-data checksum. > I am not positive if it will work in practice-- the most likely > objects to be cached are filesystem metadata. Won't most filesystems > fall apart if some of their data structures revert back to an earlier > point of time? For database workload, most of data cached on SSD is data blocks of database file which are replied from binlog (for example mysql). File system won't complain for such situation, and an early version means all transactions information since last update are all lost, in *silence*. Even the read request failed on file system meta data, because finally a stale data will be provided to kernel file system code, it is probably file system won't complain as well. Because, - file system reports error when I/O failed, if a stale data from recovery provided to file system, file system just uses the stale data until a worse failure detected by file system code. - if file system use a metadata checksum, and the checksum is inside metadata block (it is quite common), because the stale data is also checksum consistent, file system won't report error as well. So the data corruption happens in application level, even file system kernel code still thinks everything is consistent on disk .... Thanks. Coly Li > On Tue, Sep 19, 2017 at 3:24 PM, Coly Li <colyli@suse.de> wrote: >> When bcache does read I/Os, for example in writeback or writethrough mode, >> if a read request on cache device is failed, bcache will try to recovery >> the request by reading from cached device. If the data on cached device is >> not synced with cache device, then requester will get a stale data. >> >> For critical storage system like database, providing stale data from >> recovery may result an application level data corruption, which is >> unacceptible. But for some other situation like multi-media stream cache, >> continuous service may be more important and it is acceptible to fetch >> a chunk of stale data. >> >> This patch tries to solve the above conflict by adding a sysfs option >> /sys/block/bcache<idx>/bcache/allow_stale_data_on_failure >> which is defaultly cleared (to 0) as disabled. Now people can make choices >> for different situations. >> >> With this patch, for a failed read request in writeback or writethrough >> mode, recovery a recoverable read request only happens in one of the >> following conditions, >> - dc->has_dirty is zero. It means all data on cache device is synced to >> cached device, the recoveried data is up-to-date. >> - dc->has_dirty is non-zero, and dc->allow_stale_data_on_failure is set >> to 1. It means there is dirty data not synced to cached device yet, but >> option allow_stale_data_on_failure is set, receiving stale data is >> explicitly acceptible for requester. >> >> For other cache modes in bcache, read request will never hit >> cached_dev_read_error(), they don't need this patch. >> >> Please note, because cache mode can be switched arbitrarily in run time, a >> writethrough mode might be switched from a writeback mode. Therefore >> checking dc->has_data in writethrough mode still makes sense. >> >> Changelog: >> v2: rename sysfs entry from allow_stale_data_on_failure to >> allow_stale_data_on_failure, and fix the confusing commit log. >> v1: initial patch posted. >> >> Signed-off-by: Coly Li <colyli@suse.de> >> Reported-by: Arne Wolf <awolf@lenovo.com> >> Cc: Nix <nix@esperi.org.uk> >> Cc: Kai Krakow <hurikhan77@gmail.com> >> Cc: Eric Wheeler <bcache@lists.ewheeler.net> >> Cc: Junhui Tang <tang.junhui@zte.com.cn> >> Cc: stable@vger.kernel.org [snip]
On Wed, Sep 20, 2017 at 3:28 AM, Coly Li <colyli@suse.de> wrote: > Even the read request failed on file system meta data, because finally a > stale data will be provided to kernel file system code, it is probably > file system won't complain as well. The scary case is when filesystem data that points to other filesystem data is cached. E.g. the data structures representing what space is free on disk, or a directory, or a database btree. Some examples: Free space handling-- if a big file /foo is created, and the active free-space datastructures are in cache (and this is likely, because actively written places can have their writeback-writes cancelled/deferred indefinitely)-- and then later the caching disk fails, an old version of this will be read from disk. Later, an effort to write a file /bar allocates the space used by /foo, and writes over it. Directory entity handling-- if /var/spool/foo is an active directory (associated data structures in cache), and has the directory /var/spool/foo/bar under it, and then /bar is removed... the backing disk will still have a reference to bar. If the space for bar is then used for something else, the kernel may end up reading something very different from what it expects for a directory later after a cache device failure. Btrees, etc-- the same thing. If a tree shrinks, old tree entitys can end up pointing to other kinds of data. I think this change is harmful-- it is not a good idea to automatically, at runtime, decide to start returning data that violates the guarantees a block device is supposed to obey about ordering and persistence. Mike
On 2017/9/20 下午5:40, Michael Lyle wrote: > On Wed, Sep 20, 2017 at 3:28 AM, Coly Li <colyli@suse.de> wrote: >> Even the read request failed on file system meta data, because finally a >> stale data will be provided to kernel file system code, it is probably >> file system won't complain as well. > > The scary case is when filesystem data that points to other filesystem > data is cached. E.g. the data structures representing what space is > free on disk, or a directory, or a database btree. Some examples: > > Free space handling-- if a big file /foo is created, and the active > free-space datastructures are in cache (and this is likely, because > actively written places can have their writeback-writes > cancelled/deferred indefinitely)-- and then later the caching disk > fails, an old version of this will be read from disk. Later, an > effort to write a file /bar allocates the space used by /foo, and > writes over it. > > Directory entity handling-- if /var/spool/foo is an active directory > (associated data structures in cache), and has the directory > /var/spool/foo/bar under it, and then /bar is removed... the backing > disk will still have a reference to bar. If the space for bar is then > used for something else, the kernel may end up reading something very > different from what it expects for a directory later after a cache > device failure. > > Btrees, etc-- the same thing. If a tree shrinks, old tree entitys can > end up pointing to other kinds of data. > > I think this change is harmful-- it is not a good idea to > automatically, at runtime, decide to start returning data that > violates the guarantees a block device is supposed to obey about > ordering and persistence. Hi Mike, I totally agree with you. It is my fault for the misleading commit log, if you read it again you may find we stand on same side, this is what I feel from your response :-) Current bcache code does provide stale data from read failure recovery. In v1 patch discussion people wanted to keep this behavior then in v2 version I add an option to permit this "harmful" behavior, and disable this behavior by default. And good to know Kent does not like an option, then we can disable this "harmful" behavior by default. Thanks. Coly
diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h index dee542fff68e..f26b174f409a 100644 --- a/drivers/md/bcache/bcache.h +++ b/drivers/md/bcache/bcache.h @@ -356,6 +356,7 @@ struct cached_dev { unsigned partial_stripes_expensive:1; unsigned writeback_metadata:1; unsigned writeback_running:1; + unsigned allow_stale_data_on_failure:1; unsigned char writeback_percent; unsigned writeback_delay; diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c index 019b3df9f1c6..becbc0959ca2 100644 --- a/drivers/md/bcache/request.c +++ b/drivers/md/bcache/request.c @@ -702,8 +702,20 @@ static void cached_dev_read_error(struct closure *cl) { struct search *s = container_of(cl, struct search, cl); struct bio *bio = &s->bio.bio; + struct cached_dev *dc = container_of(s->d, struct cached_dev, disk); + int recovery_stale_data = dc ? dc->allow_stale_data_on_failure : 0; - if (s->recoverable) { + /* + * If dc->has_dirty is non-zero and the recovering data is on cache + * device, then recover from cached device will return a stale data + * to requester. But in some cases people accept stale data to avoid + * a -EIO. So I/O error recovery only happens when, + * - No dirty data on cache device. + * - Cached device is dirty but sysfs allow_stale_data_on_failure is + * explicitly set (to 1) to accept stale data from recovery. + */ + if (s->recoverable && + (!atomic_read(&dc->has_dirty) || recovery_stale_data)) { /* Retry from the backing device: */ trace_bcache_read_retry(s->orig_bio); diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c index f90f13616980..8603756005a8 100644 --- a/drivers/md/bcache/sysfs.c +++ b/drivers/md/bcache/sysfs.c @@ -106,6 +106,7 @@ rw_attribute(cache_replacement_policy); rw_attribute(btree_shrinker_disabled); rw_attribute(copy_gc_enabled); rw_attribute(size); +rw_attribute(allow_stale_data_on_failure); SHOW(__bch_cached_dev) { @@ -125,6 +126,7 @@ SHOW(__bch_cached_dev) var_printf(bypass_torture_test, "%i"); var_printf(writeback_metadata, "%i"); var_printf(writeback_running, "%i"); + var_printf(allow_stale_data_on_failure,"%i"); var_print(writeback_delay); var_print(writeback_percent); sysfs_hprint(writeback_rate, dc->writeback_rate.rate << 9); @@ -201,6 +203,7 @@ STORE(__cached_dev) #define d_strtoi_h(var) sysfs_hatoi(var, dc->var) sysfs_strtoul(data_csum, dc->disk.data_csum); + d_strtoul(allow_stale_data_on_failure); d_strtoul(verify); d_strtoul(bypass_torture_test); d_strtoul(writeback_metadata); @@ -335,6 +338,7 @@ static struct attribute *bch_cached_dev_files[] = { &sysfs_verify, &sysfs_bypass_torture_test, #endif + &sysfs_allow_stale_data_on_failure, NULL }; KTYPE(bch_cached_dev);
When bcache does read I/Os, for example in writeback or writethrough mode, if a read request on cache device is failed, bcache will try to recovery the request by reading from cached device. If the data on cached device is not synced with cache device, then requester will get a stale data. For critical storage system like database, providing stale data from recovery may result an application level data corruption, which is unacceptible. But for some other situation like multi-media stream cache, continuous service may be more important and it is acceptible to fetch a chunk of stale data. This patch tries to solve the above conflict by adding a sysfs option /sys/block/bcache<idx>/bcache/allow_stale_data_on_failure which is defaultly cleared (to 0) as disabled. Now people can make choices for different situations. With this patch, for a failed read request in writeback or writethrough mode, recovery a recoverable read request only happens in one of the following conditions, - dc->has_dirty is zero. It means all data on cache device is synced to cached device, the recoveried data is up-to-date. - dc->has_dirty is non-zero, and dc->allow_stale_data_on_failure is set to 1. It means there is dirty data not synced to cached device yet, but option allow_stale_data_on_failure is set, receiving stale data is explicitly acceptible for requester. For other cache modes in bcache, read request will never hit cached_dev_read_error(), they don't need this patch. Please note, because cache mode can be switched arbitrarily in run time, a writethrough mode might be switched from a writeback mode. Therefore checking dc->has_data in writethrough mode still makes sense. Changelog: v2: rename sysfs entry from allow_stale_data_on_failure to allow_stale_data_on_failure, and fix the confusing commit log. v1: initial patch posted. Signed-off-by: Coly Li <colyli@suse.de> Reported-by: Arne Wolf <awolf@lenovo.com> Cc: Nix <nix@esperi.org.uk> Cc: Kai Krakow <hurikhan77@gmail.com> Cc: Eric Wheeler <bcache@lists.ewheeler.net> Cc: Junhui Tang <tang.junhui@zte.com.cn> Cc: stable@vger.kernel.org --- drivers/md/bcache/bcache.h | 1 + drivers/md/bcache/request.c | 14 +++++++++++++- drivers/md/bcache/sysfs.c | 4 ++++ 3 files changed, 18 insertions(+), 1 deletion(-)