diff mbox

bcache: only recovery I/O error for writethrough mode

Message ID 20170710111828.97918-1-colyli@suse.de (mailing list archive)
State New, archived
Headers show

Commit Message

Coly Li July 10, 2017, 11:18 a.m. UTC
If a read bio to cache device gets failed, bcache will try to recovery it
by forward the read bio to backing device. If backing device responses
read request successfully then the bio contains data from backing device
will be returned to uppper layer.

The recovery effort in cached_dev_read_error() is not correct, and there
is report that corrupted data may returned when a dirty cache device goes
offline during reading I/O.

For writeback cache mode, before dirty data are wrote back to backing
device, data blocks on backing device are not updated and consistent. If
a dirty cache device dropped and a read bio gets failed, bcache will
return its stale version from backing device. This is mistaken behavior
that applications don't expected, especially for data base workload.

This patch fixes the issue by only permit recoverable I/O when cached
device is in writethough mode, and s->recoverable is set. For other cache
mode, recovery I/O failure by reading backing device does not make sense,
bache just simply returns -EIO immediately.

Reported-by: Arne Wolf <awolf@lenovo.com>
Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/request.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

Comments

Coly Li July 11, 2017, 3:55 a.m. UTC | #1
On 2017/7/11 上午5:46, Kai Krakow wrote:
> Am Mon, 10 Jul 2017 19:18:28 +0800
> schrieb Coly Li <colyli@suse.de>:
> 
>> If a read bio to cache device gets failed, bcache will try to
>> recovery it by forward the read bio to backing device. If backing
>> device responses read request successfully then the bio contains data
>> from backing device will be returned to uppper layer.
>>
>> The recovery effort in cached_dev_read_error() is not correct, and
>> there is report that corrupted data may returned when a dirty cache
>> device goes offline during reading I/O.
>>
>> For writeback cache mode, before dirty data are wrote back to backing
>> device, data blocks on backing device are not updated and consistent.
>> If a dirty cache device dropped and a read bio gets failed, bcache
>> will return its stale version from backing device. This is mistaken
>> behavior that applications don't expected, especially for data base
>> workload.
>>
>> This patch fixes the issue by only permit recoverable I/O when cached
>> device is in writethough mode, and s->recoverable is set. For other
>> cache mode, recovery I/O failure by reading backing device does not
>> make sense, bache just simply returns -EIO immediately.
> 
> Shouldn't write-around mode be okay for the same reason, i.e. there's
> no stale version on disk?
> 
> So the patch should read "mode != CACHE_MODE_WRITEBACK" (not sure about
> the constant), that would also match your textual description.
> 

For write-around mode, there is only backing device as bdev target, it
does not have such inconsistent data issue.

And indeed, this is a try-best effort. I mean, if a cached device is
changed from writeback mode to writethrough mode, then the cache device
is offline with dirty data, data corruption will still be probably to
happen. Anyway, this patch fixes most of cases when people do not change
cache mode in runtime.

Thanks.

Coly

> 
>> Reported-by: Arne Wolf <awolf@lenovo.com>
>> Signed-off-by: Coly Li <colyli@suse.de>
>> ---
>>  drivers/md/bcache/request.c | 5 ++++-
>>  1 file changed, 4 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
>> index 019b3df9f1c6..6edacac9b00d 100644
>> --- a/drivers/md/bcache/request.c
>> +++ b/drivers/md/bcache/request.c
>> @@ -702,8 +702,11 @@ static void cached_dev_read_error(struct closure
>> *cl) {
>>  	struct search *s = container_of(cl, struct search, cl);
>>  	struct bio *bio = &s->bio.bio;
>> +	struct cached_dev *dc = container_of(s->d, struct
>> cached_dev, disk);
>> +	unsigned mode = cache_mode(dc, NULL);
>>  
>> -	if (s->recoverable) {
>> +	if (s->recoverable &&
>> +	    (mode == CACHE_MODE_WRITETHROUGH)) {
>>  		/* Retry from the backing device: */
>>  		trace_bcache_read_retry(s->orig_bio);
>>  
> 
> 
>
Eric Wheeler July 11, 2017, 5:42 p.m. UTC | #2
On Tue, 11 Jul 2017, Coly Li wrote:
> On 2017/7/11 上午5:46, Kai Krakow wrote:
> > Am Mon, 10 Jul 2017 19:18:28 +0800
> > schrieb Coly Li <colyli@suse.de>:
> > 
> >> If a read bio to cache device gets failed, bcache will try to
> >> recovery it by forward the read bio to backing device. If backing
> >> device responses read request successfully then the bio contains data
> >> from backing device will be returned to uppper layer.
> >>
> >> The recovery effort in cached_dev_read_error() is not correct, and
> >> there is report that corrupted data may returned when a dirty cache
> >> device goes offline during reading I/O.
> >>
> >> For writeback cache mode, before dirty data are wrote back to backing
> >> device, data blocks on backing device are not updated and consistent.
> >> If a dirty cache device dropped and a read bio gets failed, bcache
> >> will return its stale version from backing device. This is mistaken
> >> behavior that applications don't expected, especially for data base
> >> workload.
> >>
> >> This patch fixes the issue by only permit recoverable I/O when cached
> >> device is in writethough mode, and s->recoverable is set. For other
> >> cache mode, recovery I/O failure by reading backing device does not
> >> make sense, bache just simply returns -EIO immediately.

Since bcache keeps the dirty map in memory, can you allow the passthrough 
to the backing device if the block is clean, but -EIO otherwise?


--
Eric Wheeler



> > 
> > Shouldn't write-around mode be okay for the same reason, i.e. there's
> > no stale version on disk?
> > 
> > So the patch should read "mode != CACHE_MODE_WRITEBACK" (not sure about
> > the constant), that would also match your textual description.
> > 
> 
> For write-around mode, there is only backing device as bdev target, it
> does not have such inconsistent data issue.
> 
> And indeed, this is a try-best effort. I mean, if a cached device is
> changed from writeback mode to writethrough mode, then the cache device
> is offline with dirty data, data corruption will still be probably to
> happen. Anyway, this patch fixes most of cases when people do not change
> cache mode in runtime.
> 
> Thanks.
> 
> Coly
> 
> > 
> >> Reported-by: Arne Wolf <awolf@lenovo.com>
> >> Signed-off-by: Coly Li <colyli@suse.de>
> >> ---
> >>  drivers/md/bcache/request.c | 5 ++++-
> >>  1 file changed, 4 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
> >> index 019b3df9f1c6..6edacac9b00d 100644
> >> --- a/drivers/md/bcache/request.c
> >> +++ b/drivers/md/bcache/request.c
> >> @@ -702,8 +702,11 @@ static void cached_dev_read_error(struct closure
> >> *cl) {
> >>  	struct search *s = container_of(cl, struct search, cl);
> >>  	struct bio *bio = &s->bio.bio;
> >> +	struct cached_dev *dc = container_of(s->d, struct
> >> cached_dev, disk);
> >> +	unsigned mode = cache_mode(dc, NULL);
> >>  
> >> -	if (s->recoverable) {
> >> +	if (s->recoverable &&
> >> +	    (mode == CACHE_MODE_WRITETHROUGH)) {
> >>  		/* Retry from the backing device: */
> >>  		trace_bcache_read_retry(s->orig_bio);
> >>  
> > 
> > 
> > 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
Coly Li July 12, 2017, 1:32 a.m. UTC | #3
On 2017/7/12 上午1:42, Eric Wheeler wrote:
> On Tue, 11 Jul 2017, Coly Li wrote:
>> On 2017/7/11 上午5:46, Kai Krakow wrote:
>>> Am Mon, 10 Jul 2017 19:18:28 +0800
>>> schrieb Coly Li <colyli@suse.de>:
>>>
>>>> If a read bio to cache device gets failed, bcache will try to
>>>> recovery it by forward the read bio to backing device. If backing
>>>> device responses read request successfully then the bio contains data
>>>> from backing device will be returned to uppper layer.
>>>>
>>>> The recovery effort in cached_dev_read_error() is not correct, and
>>>> there is report that corrupted data may returned when a dirty cache
>>>> device goes offline during reading I/O.
>>>>
>>>> For writeback cache mode, before dirty data are wrote back to backing
>>>> device, data blocks on backing device are not updated and consistent.
>>>> If a dirty cache device dropped and a read bio gets failed, bcache
>>>> will return its stale version from backing device. This is mistaken
>>>> behavior that applications don't expected, especially for data base
>>>> workload.
>>>>
>>>> This patch fixes the issue by only permit recoverable I/O when cached
>>>> device is in writethough mode, and s->recoverable is set. For other
>>>> cache mode, recovery I/O failure by reading backing device does not
>>>> make sense, bache just simply returns -EIO immediately.
> 
> Since bcache keeps the dirty map in memory, can you allow the passthrough 
> to the backing device if the block is clean, but -EIO otherwise?

Hi Eric,

This is a nice idea, I also discussed with other people around me. A
reason not to choose it is, for data base applications in clustering
environment, it is not very important for dirty data lost on
disconnected from one server, because the work load can be switched to
other server and replay inside the cluster. But it is important to
detect the disconnected cache device failure as early as possible, so
the work load can be switched with less dirty data lost.

Finally I choose simply returning an -EIO to I/O requester, to make
application to know the cache not available earlier and not to introduce
more inconsistent data on backing device.

Coly

> 
>>>
>>> Shouldn't write-around mode be okay for the same reason, i.e. there's
>>> no stale version on disk?
>>>
>>> So the patch should read "mode != CACHE_MODE_WRITEBACK" (not sure about
>>> the constant), that would also match your textual description.
>>>
>>
>> For write-around mode, there is only backing device as bdev target, it
>> does not have such inconsistent data issue.
>>
>> And indeed, this is a try-best effort. I mean, if a cached device is
>> changed from writeback mode to writethrough mode, then the cache device
>> is offline with dirty data, data corruption will still be probably to
>> happen. Anyway, this patch fixes most of cases when people do not change
>> cache mode in runtime.
>>
>> Thanks.
>>
>> Coly
>>
>>>
>>>> Reported-by: Arne Wolf <awolf@lenovo.com>
>>>> Signed-off-by: Coly Li <colyli@suse.de>
>>>> ---
>>>>  drivers/md/bcache/request.c | 5 ++++-
>>>>  1 file changed, 4 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
>>>> index 019b3df9f1c6..6edacac9b00d 100644
>>>> --- a/drivers/md/bcache/request.c
>>>> +++ b/drivers/md/bcache/request.c
>>>> @@ -702,8 +702,11 @@ static void cached_dev_read_error(struct closure
>>>> *cl) {
>>>>  	struct search *s = container_of(cl, struct search, cl);
>>>>  	struct bio *bio = &s->bio.bio;
>>>> +	struct cached_dev *dc = container_of(s->d, struct
>>>> cached_dev, disk);
>>>> +	unsigned mode = cache_mode(dc, NULL);
>>>>  
>>>> -	if (s->recoverable) {
>>>> +	if (s->recoverable &&
>>>> +	    (mode == CACHE_MODE_WRITETHROUGH)) {
>>>>  		/* Retry from the backing device: */
>>>>  		trace_bcache_read_retry(s->orig_bio);
>>>>  
>>>
>>>
>>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Coly Li July 12, 2017, 1:37 a.m. UTC | #4
On 2017/7/12 上午8:34, tang.junhui@zte.com.cn wrote:
>> If a read bio to cache device gets failed, bcache will try to recovery it
>> by forward the read bio to backing device. If backing device responses
>> read request successfully then the bio contains data from backing device
>> will be returned to uppper layer.
>>
>> The recovery effort in cached_dev_read_error() is not correct, and there
>> is report that corrupted data may returned when a dirty cache device goes
>> offline during reading I/O.
>>
>> For writeback cache mode, before dirty data are wrote back to backing
>> device, data blocks on backing device are not updated and consistent. If
>> a dirty cache device dropped and a read bio gets failed, bcache will
>> return its stale version from backing device. This is mistaken behavior
>> that applications don't expected, especially for data base workload.
>>
>> This patch fixes the issue by only permit recoverable I/O when cached
>> device is in writethough mode, and s->recoverable is set. For other cache
>> mode, recovery I/O failure by reading backing device does not make sense,
>> bache just simply returns -EIO immediately.
>>
>> Reported-by: Arne Wolf <awolf@lenovo.com>
>> Signed-off-by: Coly Li <colyli@suse.de>
>> ---
>>  drivers/md/bcache/request.c | 5 ++++-
>>  1 file changed, 4 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
>> index 019b3df9f1c6..6edacac9b00d 100644
>> --- a/drivers/md/bcache/request.c
>> +++ b/drivers/md/bcache/request.c
>> @@ -702,8 +702,11 @@ static void cached_dev_read_error(struct closure
> *cl)
>>  {
>>                   struct search *s = container_of(cl, struct search, cl);
>>                   struct bio *bio = &s->bio.bio;
>> +                 struct cached_dev *dc = container_of(s->d, struct
> cached_dev, disk);
>> +                 unsigned mode = cache_mode(dc, NULL);
>>  
>> -                 if (s->recoverable) {
>> +                 if (s->recoverable &&
>> +                     (mode == CACHE_MODE_WRITETHROUGH)) {
>>                                    /* Retry from the backing device: */
>>                                    trace_bcache_read_retry(s->orig_bio);
>>  
>> --
> 
> No, Poly,
> Cache mode can change dynamically,
> So maybe the bcache device is still dirty
> when the cache mode is changed to CACHE_MODE_WRITETHROUGH,
> so I think this patch can not solve this question fundamentally.

Hi Junnhui,

Yes I agree with you, if cache mode is switched from writeback to
writethrough or writearound mode, and then the cache device is
disconnected, this patch does not help us.

When I replied question from Kai Krakow, I expressed similar opinion as
yours. This patch is a best-effort fix, it is very necessary for data
base applications which always use writeback mode and not switch to
other mode during all their online time.

Thanks.

Coly
Coly Li July 12, 2017, 1:45 a.m. UTC | #5
On 2017/7/12 上午9:43, tang.junhui@zte.com.cn wrote:
>>>> If a read bio to cache device gets failed, bcache will try to
> recovery it
>>>> by forward the read bio to backing device. If backing device responses
>>>> read request successfully then the bio contains data from backing
> device
>>>> will be returned to uppper layer.
>>>>
>>>> The recovery effort in cached_dev_read_error() is not correct, and
> there
>>>> is report that corrupted data may returned when a dirty cache device
> goes
>>>> offline during reading I/O.
>>>>
>>>> For writeback cache mode, before dirty data are wrote back to backing
>>>> device, data blocks on backing device are not updated and
> consistent. If
>>>> a dirty cache device dropped and a read bio gets failed, bcache will
>>>> return its stale version from backing device. This is mistaken behavior
>>>> that applications don't expected, especially for data base workload.
>>>>
>>>> This patch fixes the issue by only permit recoverable I/O when cached
>>>> device is in writethough mode, and s->recoverable is set. For other
> cache
>>>> mode, recovery I/O failure by reading backing device does not make
> sense,
>>>> bache just simply returns -EIO immediately.
>>>>
>>>> Reported-by: Arne Wolf <awolf@lenovo.com>
>>>> Signed-off-by: Coly Li <colyli@suse.de>
>>>> ---
>>>>  drivers/md/bcache/request.c | 5 ++++-
>>>>  1 file changed, 4 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
>>>> index 019b3df9f1c6..6edacac9b00d 100644
>>>> --- a/drivers/md/bcache/request.c
>>>> +++ b/drivers/md/bcache/request.c
>>>> @@ -702,8 +702,11 @@ static void cached_dev_read_error(struct closure
>>> *cl)
>>>>  {
>>>>                   struct search *s = container_of(cl, struct search,
> cl);
>>>>                   struct bio *bio = &s->bio.bio;
>>>> +                 struct cached_dev *dc = container_of(s->d, struct
>>> cached_dev, disk);
>>>> +                 unsigned mode = cache_mode(dc, NULL);
>>>>  
>>>> -                 if (s->recoverable) {
>>>> +                 if (s->recoverable &&
>>>> +                     (mode == CACHE_MODE_WRITETHROUGH)) {
>>>>                                    /* Retry from the backing device: */
>>>>                                  
>  trace_bcache_read_retry(s->orig_bio);
>>>>  
>>>> --
>>>
>>> No, Poly,
>>> Cache mode can change dynamically,
>>> So maybe the bcache device is still dirty
>>> when the cache mode is changed to CACHE_MODE_WRITETHROUGH,
>>> so I think this patch can not solve this question fundamentally.
>>
>>Hi Junnhui,
>>
>>Yes I agree with you, if cache mode is switched from writeback to
>>writethrough or writearound mode, and then the cache device is
>>disconnected, this patch does not help us.
>>
>>When I replied question from Kai Krakow, I expressed similar opinion as
>>yours. This patch is a best-effort fix, it is very necessary for data
>>base applications which always use writeback mode and not switch to
>>other mode during all their online time.
>  
> Since this issue is important, we should find out a solution
> to solve it completely, not just do best-effort.
> 

I meant "it is very necessary for data base applications which always
use *writeback* mode and not switch to other mode during all their
online time."  ^_^

Thanks.

Coly
Coly Li July 12, 2017, 3:20 a.m. UTC | #6
On 2017/7/12 上午10:01, tang.junhui@zte.com.cn wrote:
>>I meant "it is very necessary for data base applications which always
>>use *writeback* mode and not switch to other mode during all their
>>online time."  ^_^
> 
> I know, it is necessary, but not enough. who can promise they will not
> switch during online time? This patch is logical imperfectly.

Yes, I agree with you. Since Eric mentions dirty data map, an improved
fix shows up in my head,

When cache device disconnected from system,
0) If in non mode, do nothing.
1) If in writeback/writethough/writearound mode, and dirty map is clean,
   - switch to non mode
   - continue to handle I/O without cache device
2) If in writeback mode, and dirty map is not clean,
   - return -EIO immediately for WRITE request
   - return -EIO immediately for READ request (*)
3) If not in writeback mode, and dirty map is not clean. It means the
cache mode is switched from writeback mode with dirty data lost, then
   - returns -EIO immediately for WRITE request
   - returns -EIO immediately for READ request (*)

(*) NOTE:
A sysfs entry "recovery_io_error" can be add here, which is disabled as
default. If it is enabled, if a READ request does not hit dirty map,
bcache will provide it from backing device.

By the above method, if no dirty data lost and cache device
disconnected, bcache steps back to non mode. If we have dirty data lost,
always returns -EIO to notify application to handle the failure as soon
as possible. And there will be a little more failure tolerance if
"recovery_io_error" is enabled.

Eric, Junhui,

How do you think about the above idea ? Thanks in advance for your comments.

Coly
Eric Wheeler July 13, 2017, 12:53 a.m. UTC | #7
On Wed, 12 Jul 2017, Coly Li wrote:

> On 2017/7/12 上午10:01, tang.junhui@zte.com.cn wrote:
> >>I meant "it is very necessary for data base applications which always
> >>use *writeback* mode and not switch to other mode during all their
> >>online time."  ^_^
> > 
> > I know, it is necessary, but not enough. who can promise they will not
> > switch during online time? This patch is logical imperfectly.
> 
> Yes, I agree with you. Since Eric mentions dirty data map, an improved
> fix shows up in my head,
> 
> When cache device disconnected from system,
> 0) If in non mode, do nothing.

Does non mode guarantee that nothing is dirty?  I'm not convinced of that.  
I think you can set non mode with dirty blocks. (Correct me if I'm wrong 
here.)

> 1) If in writeback/writethough/writearound mode, and dirty map is clean,
>    - switch to non mode
>    - continue to handle I/O without cache device

Sure, that makes sense.  

> 2) If in writeback mode, and dirty map is not clean,

You would want to do a dirty map lookup for each IO.  How about this:

2) If in _any_ mode, and dirty map is dirty for *that specific block*:

  If WRITE request completely overlaps the dirty segment then clear 
	the dirty flag and pass through to the backing dev.
  otherwise:
    - return -EIO immediately for WRITE request
    - return -EIO immediately for READ request (*)

If the WRITE request completely overlaps the dirty segment as indicated 
from the in-memory metadata, then clear its dirty flag and write to the 
backing device.  Whatever was dirty isn't important anymore as it was 
overwritten.

Unless there is a good reason to diverge, we would want this recovery 
logic would be the same for failed IOs from an existing cachedev (eg, with 
badblocks), and for cachedevs that are altogether missing.


> 3) If not in writeback mode, and dirty map is not clean. It means the
> cache mode is switched from writeback mode with dirty data lost, then
>    - returns -EIO immediately for WRITE request
>    - returns -EIO immediately for READ request (*)

For #2,3, do a dirty map lookup for every IO: if the block is clean, pass 
it to the backing device.  Only -EIO if the request cannot be recovered 
(block is dirty) and invoke pr_crit_once() to notify the user.  We want 
all IO requests to succeed to the extent possible.

I think #3 is the same case as #2.  The logic is the same whether its is 
now or ever was in writeback mode, regardless of the current mode.
 
> (*) NOTE:
> A sysfs entry "recovery_io_error" can be add here, which is disabled as
> default. If it is enabled, if a READ request does not hit dirty map,
> bcache will provide it from backing device.

Resilience first!  This should default on.  

Sometimes systems run for months with bad sectors, and this case is no 
different.  Let the bcache users' IOs succeed if possible but notify them 
with pr_crit_once().

A reboot might loose data.  They could be lucky with important data in the 
page cache; notifying them without killing the device because it is dirty 
might give them a chance to do a hot backup before rebooting (or 
stop/starting bcache).

Since the dirty map is still in memory, that information is 
useful for recovery.  After a reboot the dirty map is lost---and with it 
the data about what is consistent and what is not. 

For example, if LVM snapshots sit atop of the bcache volume, then you 
could `dd` them off.  If you hit an error, you know that copy is at least 
partially inconsistent and can try an older snapshot until one is found 
which is old enough to be 100% consistent.  Without the dirty map, you 
would only be guessing at which volume is actually consistent.

Let users set recovery_io_error=0 for those who really want to fail early.

--
Eric Wheeler
Coly Li July 13, 2017, 6:47 a.m. UTC | #8
What we are discussing is more then the original patch, now the topic
changes to how handling cache device disconnection more properly. So I
change email thread subject.

On 2017/7/13 上午8:53, Eric Wheeler wrote:
> On Wed, 12 Jul 2017, Coly Li wrote:
> 
>> On 2017/7/12 上午10:01, tang.junhui@zte.com.cn wrote:
>>>> I meant "it is very necessary for data base applications which always
>>>> use *writeback* mode and not switch to other mode during all their
>>>> online time."  ^_^
>>>
>>> I know, it is necessary, but not enough. who can promise they will not
>>> switch during online time? This patch is logical imperfectly.
>>
>> Yes, I agree with you. Since Eric mentions dirty data map, an improved
>> fix shows up in my head,
>>
>> When cache device disconnected from system,
>> 0) If in non mode, do nothing.
> 
> Does non mode guarantee that nothing is dirty?  I'm not convinced of that.  
> I think you can set non mode with dirty blocks. (Correct me if I'm wrong 
> here.)
> 

I think you are correct. Your question notices me, that it is still
possible that user switches cache mode more then once as they want,
maybe some sequence like this,
 writeback -> writethrough -> none -> writeback -> none ....
So we should always check whether dirty map exists or clean, no matter
what current cache mode is.

Nice hint :-)


>> 1) If in writeback/writethough/writearound mode, and dirty map is clean,
>>    - switch to non mode
>>    - continue to handle I/O without cache device
> 
> Sure, that makes sense.  
> 
>> 2) If in writeback mode, and dirty map is not clean,
> 
> You would want to do a dirty map lookup for each IO.  How about this:
> 
> 2) If in _any_ mode, and dirty map is dirty for *that specific block*:
> 
>   If WRITE request completely overlaps the dirty segment then clear 
> 	the dirty flag and pass through to the backing dev.
>   otherwise:
>     - return -EIO immediately for WRITE request
>     - return -EIO immediately for READ request (*)
> 
> If the WRITE request completely overlaps the dirty segment as indicated 
> from the in-memory metadata, then clear its dirty flag and write to the 
> backing device.  Whatever was dirty isn't important anymore as it was 
> overwritten.

What I worried here is, the lost dirty data blocks is probably to have
inter-dependency, e.g. file system metadata (maybe database transaction
records).

If the cache device disconnected and a single overlapped dirty block is
permitted to go into backing device. It may cause a more worse metadata
corruption and data lose on backing device.

The root cause is, once a metadata in memory is written into bcache, its
page cache is not diry any more and probably to be released. Then cache
device disconnected and an overlapped metadata write happens, there is
no way to keep or recovery data consistent. The best method in my brain
is: don't touch it.


> 
> Unless there is a good reason to diverge, we would want this recovery 
> logic would be the same for failed IOs from an existing cachedev (eg, with 
> badblocks), and for cachedevs that are altogether missing.
> 

For clean data lost, this is totally correct. For dirty data lost, it
might not be always correct. At least for writeback mode, this recovery
logic is buggy. Return a corrupted/stale data in silence is disaster,
this logic should be fixed.


> 
>> 3) If not in writeback mode, and dirty map is not clean. It means the
>> cache mode is switched from writeback mode with dirty data lost, then
>>    - returns -EIO immediately for WRITE request
>>    - returns -EIO immediately for READ request (*)
> 
> For #2,3, do a dirty map lookup for every IO: if the block is clean, pass 
> it to the backing device.  Only -EIO if the request cannot be recovered 
> (block is dirty) and invoke pr_crit_once() to notify the user.  We want 
> all IO requests to succeed to the extent possible.

For SSD, especially industry level NVMe SSD, there are quite a lot
redundant capacity inside. If an internal storage unit failed, SSD
controller will recovery the broken unit by map its LBA to another
internal storage unit. Which means if we encounter consistent bad block
on NVMe SSD, this is an important warning, because this device has no
internal space to remap and will die very soon.

5 years ago, I know some PCIe SSDs had 30%~50% capacity more internally
for better write performance (more space to avoid synchronized garbage
collection). If a bad block returned from such SSD, the situation is
similar to a hard disk reports 25%~30% sectors are bad. Definitely
people should replace the SSD as soon as possible.


> I think #3 is the same case as #2.  The logic is the same whether its is 
> now or ever was in writeback mode, regardless of the current mode.
> 
>> (*) NOTE:
>> A sysfs entry "recovery_io_error" can be add here, which is disabled as
>> default. If it is enabled, if a READ request does not hit dirty map,
>> bcache will provide it from backing device.
> 
> Resilience first!  This should default on.

For any storage system, data health is always first priority. Providing
corrupted data is unacceptable.

If bcache cannot recovery a correct data, returning a stale data by
recovery is insane from view of users, we should not make it happen.

But if people use bcache to accelerate applications on laptop with cheap
SSD, they care about performance much more then data consistency. And
this kind of use case should represent most of bcache development in the
world. Hmm, I cannot find a way to refute you... you are right, this
should default on.

How about this,
we make this configure default on, and explain in code and document why
it exists and in which condition it should be disabled. Then people may
disable it when they seriously cares about data consistency.

Do you think it is reasonable ?

> 
> Sometimes systems run for months with bad sectors, and this case is no 
> different.  Let the bcache users' IOs succeed if possible but notify them 
> with pr_crit_once().
> 

Sure, if we set recovery_io_error default on, its behavior should be
expect like this.



> A reboot might loose data.  They could be lucky with important data in the 
> page cache; notifying them without killing the device because it is dirty 
> might give them a chance to do a hot backup before rebooting (or 
> stop/starting bcache).
> 

After a page cache is synced to underlying storage media, it is set to
clean, and can be reclaim any time by memory manage code. Then we don't
have in-memory copy to recovery.

The situation is, the data block is considered clean in memory, clean in
backing device, and dirty in cache device. If cache device disconnected,
we only know a specific dirty data lost, and no cheap way to recovery it
from memory.

This is why I proposed to return -EIO immediately if recovery_io_error=1.

> Since the dirty map is still in memory, that information is 
> useful for recovery.  After a reboot the dirty map is lost---and with it 
> the data about what is consistent and what is not. 
> 
> For example, if LVM snapshots sit atop of the bcache volume, then you 
> could `dd` them off.  If you hit an error, you know that copy is at least 
> partially inconsistent and can try an older snapshot until one is found 
> which is old enough to be 100% consistent.  Without the dirty map, you 
> would only be guessing at which volume is actually consistent.
>
> Let users set recovery_io_error=0 for those who really want to fail early.

Yes, your suggestion is great, let's set recovery_io_error=1 as default.

From the above discussion, if:
- recovery_io_error=1 is set as default.
- dirty map should still be check in "non" mode (this is any mode that
you suggested), and handled properly.
do you think there is any thing more to fix on my proposal ?

Thanks for all the hints!

Coly
Eric Wheeler July 24, 2017, 7:19 p.m. UTC | #9
On Thu, 13 Jul 2017, Coly Li wrote:

> What we are discussing is more then the original patch, now the topic
> changes to how handling cache device disconnection more properly. So I
> change email thread subject.
> 
> On 2017/7/13 上午8:53, Eric Wheeler wrote:
> > On Wed, 12 Jul 2017, Coly Li wrote:
> > 
> >> On 2017/7/12 上午10:01, tang.junhui@zte.com.cn wrote:
> >>>> I meant "it is very necessary for data base applications which always
> >>>> use *writeback* mode and not switch to other mode during all their
> >>>> online time."  ^_^
> >>>
> >>> I know, it is necessary, but not enough. who can promise they will not
> >>> switch during online time? This patch is logical imperfectly.
> >>
> >> Yes, I agree with you. Since Eric mentions dirty data map, an improved
> >> fix shows up in my head,
> >>
> >> When cache device disconnected from system,
> >> 0) If in non mode, do nothing.
> > 
> > Does non mode guarantee that nothing is dirty?  I'm not convinced of that.  
> > I think you can set non mode with dirty blocks. (Correct me if I'm wrong 
> > here.)
> > 
> 
> I think you are correct. Your question notices me, that it is still
> possible that user switches cache mode more then once as they want,
> maybe some sequence like this,
>  writeback -> writethrough -> none -> writeback -> none ....
> So we should always check whether dirty map exists or clean, no matter
> what current cache mode is.
> 
> Nice hint :-)
> 
> 
> >> 1) If in writeback/writethough/writearound mode, and dirty map is clean,
> >>    - switch to non mode
> >>    - continue to handle I/O without cache device
> > 
> > Sure, that makes sense.  
> > 
> >> 2) If in writeback mode, and dirty map is not clean,
> > 
> > You would want to do a dirty map lookup for each IO.  How about this:
> > 
> > 2) If in _any_ mode, and dirty map is dirty for *that specific block*:
> > 
> >   If WRITE request completely overlaps the dirty segment then clear 
> > 	the dirty flag and pass through to the backing dev.
> >   otherwise:
> >     - return -EIO immediately for WRITE request
> >     - return -EIO immediately for READ request (*)
> > 
> > If the WRITE request completely overlaps the dirty segment as indicated 
> > from the in-memory metadata, then clear its dirty flag and write to the 
> > backing device.  Whatever was dirty isn't important anymore as it was 
> > overwritten.
> 
> What I worried here is, the lost dirty data blocks is probably to have
> inter-dependency, e.g. file system metadata (maybe database transaction
> records).
> 
> If the cache device disconnected and a single overlapped dirty block is
> permitted to go into backing device. It may cause a more worse metadata
> corruption and data lose on backing device.

I think we might be using the term "overlapped" in two different ways: I 
think you mean overlap as a request which only partially overwrites the 
data which is dirty.  My meaning for overlap was that it completely 
overlaps the dirty data, that is, the whole block specified by the WRITE 
request is already dirty at the time of submission and exactly matches 
what the dirty map indicates such that clearing the dirty map for the 
request does not clear the dirty map for any dirty data that is not 
overwritten by the request.

I agree that WRITE requests which do not fully overwrite the dirty block 
must -EIO.

We can only write to the backing device if the WRITE request being made is 
to an offset+length that is completely dirty, in which case the related 
cache block in-memory dirty flag can be cleared.  It must completely match 
the dirty block size so the write complete replaces the dirty area.  

Does this seem correct?  If not, please suggest an example to illustrate.

> > 
> > Unless there is a good reason to diverge, we would want this recovery 
> > logic would be the same for failed IOs from an existing cachedev (eg, with 
> > badblocks), and for cachedevs that are altogether missing.
> > 
> 
> For clean data lost, this is totally correct. For dirty data lost, it
> might not be always correct. At least for writeback mode, this recovery
> logic is buggy. Return a corrupted/stale data in silence is disaster,
> this logic should be fixed.

I think we are saying the same thing.  Always return valid data or -EIO.  
I'm just suggesting that the -EIO path from the cache device should be the 
same logic whether it is because the driver returned -EIO or because the 
cache device is missing.

> > 
> >> 3) If not in writeback mode, and dirty map is not clean. It means the
> >> cache mode is switched from writeback mode with dirty data lost, then
> >>    - returns -EIO immediately for WRITE request
> >>    - returns -EIO immediately for READ request (*)
> > 
> > For #2,3, do a dirty map lookup for every IO: if the block is clean, pass 
> > it to the backing device.  Only -EIO if the request cannot be recovered 
> > (block is dirty) and invoke pr_crit_once() to notify the user.  We want 
> > all IO requests to succeed to the extent possible.
> 
> For SSD, especially industry level NVMe SSD, there are quite a lot
> redundant capacity inside. If an internal storage unit failed, SSD
> controller will recovery the broken unit by map its LBA to another
> internal storage unit. Which means if we encounter consistent bad block
> on NVMe SSD, this is an important warning, because this device has no
> internal space to remap and will die very soon.

Interesting, true!

What does bcache do currently if the cache device still exists (is not 
missing) but is getting -EIO's from the driver?
 
> 5 years ago, I know some PCIe SSDs had 30%~50% capacity more internally
> for better write performance (more space to avoid synchronized garbage
> collection). If a bad block returned from such SSD, the situation is
> similar to a hard disk reports 25%~30% sectors are bad. Definitely
> people should replace the SSD as soon as possible.

Indeed!

> > I think #3 is the same case as #2.  The logic is the same whether its is 
> > now or ever was in writeback mode, regardless of the current mode.
> > 
> >> (*) NOTE:
> >> A sysfs entry "recovery_io_error" can be add here, which is disabled as
> >> default. If it is enabled, if a READ request does not hit dirty map,
> >> bcache will provide it from backing device.
> > 
> > Resilience first!  This should default on.
> 
> For any storage system, data health is always first priority. Providing
> corrupted data is unacceptable.
> 
> If bcache cannot recovery a correct data, returning a stale data by
> recovery is insane from view of users, we should not make it happen.

Agreed.
 
> But if people use bcache to accelerate applications on laptop with cheap
> SSD, they care about performance much more then data consistency. And
> this kind of use case should represent most of bcache development in the
> world. Hmm, I cannot find a way to refute you... you are right, this
> should default on.
> 
> How about this,
> we make this configure default on, and explain in code and document why
> it exists and in which condition it should be disabled. Then people may
> disable it when they seriously cares about data consistency.
> 
> Do you think it is reasonable ?

Definitely!

> [cut pagecache bits]

(Nevermind what I said about the pagecache, it only confuses things.  
Lets stick to the block layer :) )

> > Since the dirty map is still in memory, that information is 
> > useful for recovery.  After a reboot the dirty map is lost---and with it 
> > the data about what is consistent and what is not. 
> > 
> > For example, if LVM snapshots sit atop of the bcache volume, then you 
> > could `dd` them off.  If you hit an error, you know that copy is at least 
> > partially inconsistent and can try an older snapshot until one is found 
> > which is old enough to be 100% consistent.  Without the dirty map, you 
> > would only be guessing at which volume is actually consistent.
> >
> > Let users set recovery_io_error=0 for those who really want to fail early.
> 
> Yes, your suggestion is great, let's set recovery_io_error=1 as default.
> 
> From the above discussion, if:
> - recovery_io_error=1 is set as default.
> - dirty map should still be check in "non" mode (this is any mode that
> you suggested), and handled properly.
> do you think there is any thing more to fix on my proposal ?

I think you proposal is great, thank you for all of the work on this!

-Eric

> 
> Thanks for all the hints!
> 
> Coly
> 
>
Coly Li July 26, 2017, 5:08 p.m. UTC | #10
On 07/25/2017 03:19 AM, Eric Wheeler wrote:
> On Thu, 13 Jul 2017, Coly Li wrote:
> 
>> What we are discussing is more then the original patch, now the topic
>> changes to how handling cache device disconnection more properly. So I
>> change email thread subject.
>>
>> On 2017/7/13 上午8:53, Eric Wheeler wrote:
>>> On Wed, 12 Jul 2017, Coly Li wrote:
>>>
>>>> On 2017/7/12 上午10:01, tang.junhui@zte.com.cn wrote:
>>>>>> I meant "it is very necessary for data base applications which always
>>>>>> use *writeback* mode and not switch to other mode during all their
>>>>>> online time."  ^_^
>>>>>
>>>>> I know, it is necessary, but not enough. who can promise they will not
>>>>> switch during online time? This patch is logical imperfectly.
>>>>
>>>> Yes, I agree with you. Since Eric mentions dirty data map, an improved
>>>> fix shows up in my head,
>>>>
>>>> When cache device disconnected from system,
>>>> 0) If in non mode, do nothing.
>>>
>>> Does non mode guarantee that nothing is dirty?  I'm not convinced of that.
>>> I think you can set non mode with dirty blocks. (Correct me if I'm wrong
>>> here.)
>>>
>>
>> I think you are correct. Your question notices me, that it is still
>> possible that user switches cache mode more then once as they want,
>> maybe some sequence like this,
>>   writeback -> writethrough -> none -> writeback -> none ....
>> So we should always check whether dirty map exists or clean, no matter
>> what current cache mode is.
>>
>> Nice hint :-)
>>
>>
>>>> 1) If in writeback/writethough/writearound mode, and dirty map is clean,
>>>>     - switch to non mode
>>>>     - continue to handle I/O without cache device
>>>
>>> Sure, that makes sense.
>>>
>>>> 2) If in writeback mode, and dirty map is not clean,
>>>
>>> You would want to do a dirty map lookup for each IO.  How about this:
>>>
>>> 2) If in _any_ mode, and dirty map is dirty for *that specific block*:
>>>
>>>    If WRITE request completely overlaps the dirty segment then clear
>>> 	the dirty flag and pass through to the backing dev.
>>>    otherwise:
>>>      - return -EIO immediately for WRITE request
>>>      - return -EIO immediately for READ request (*)
>>>
>>> If the WRITE request completely overlaps the dirty segment as indicated
>>> from the in-memory metadata, then clear its dirty flag and write to the
>>> backing device.  Whatever was dirty isn't important anymore as it was
>>> overwritten.
>>
>> What I worried here is, the lost dirty data blocks is probably to have
>> inter-dependency, e.g. file system metadata (maybe database transaction
>> records).
>>
>> If the cache device disconnected and a single overlapped dirty block is
>> permitted to go into backing device. It may cause a more worse metadata
>> corruption and data lose on backing device.
> 
> I think we might be using the term "overlapped" in two different ways: I
> think you mean overlap as a request which only partially overwrites the
> data which is dirty.  My meaning for overlap was that it completely
> overlaps the dirty data, that is, the whole block specified by the WRITE
> request is already dirty at the time of submission and exactly matches
> what the dirty map indicates such that clearing the dirty map for the
> request does not clear the dirty map for any dirty data that is not
> overwritten by the request.
> 
> I agree that WRITE requests which do not fully overwrite the dirty block
> must -EIO.

I understand what you meant on "overlap", my reply was not detailed 
enough, let me explain my idea again.

I mentioned, "the lost dirty data blocks is probably to have
inter-dependency, e.g. file system metadata". That is, even a failed 
request exactly overlaps with a dirty key range, permitting it goes into 
disk platter may not be correct. Because if the dirty data is file 
system metadata (indeed bcache tends to keep metadata in cache), and 
cache device disconnects from system. In this case, file system on 
cached device may be corrupted already, permit one or more dirty blocks 
hitting disk platter does not help, maybe make things worse.

Imaging cached device is formatted as Ext4 file system, and it is 
attached to cache set. During an offline fsck running on the bcache 
device, and cache device disconnected. In this case, almost all modified 
metadata blocks are on disconnected cache device. At this moment, dirty 
data (on cache device) on top of cached device is a consistent file 
system; only cached device may contain a consistent or recoverable file 
system. But if fsck continue to write meta blocks and bcache permits 
them hitting disk platter (because of exact key range overlap), then the 
file system on cached device will be in a undefined inter medium status. 
Which may mislead further fsck operations and make a worse file system 
corruption.

So when a cache device disconnects from system, unless we have following 
dirty data exactly overlaps all dirty key ranges, otherwise we should 
not permit it (them) hitting disk platter. It is possible that a write 
request just exactly overlaps existing dirty key ranges, e.g. only one 
dirty key and it exactly matches LBA range of the write request.

> 
> We can only write to the backing device if the WRITE request being made is
> to an offset+length that is completely dirty, in which case the related
> cache block in-memory dirty flag can be cleared.  It must completely match
> the dirty block size so the write complete replaces the dirty area.
> 
> Does this seem correct?  If not, please suggest an example to illustrate.
> 

I provide a fsck example in above lines. Indeed when there is dependency 
within on-disk data structure, e.g. a tree structure, on disk heartbeat 
map, permitting an exactly overlapped write request to overwrite cached 
disk is very probably problematic.


>>>
>>> Unless there is a good reason to diverge, we would want this recovery
>>> logic would be the same for failed IOs from an existing cachedev (eg, with
>>> badblocks), and for cachedevs that are altogether missing.
>>>
>>
>> For clean data lost, this is totally correct. For dirty data lost, it
>> might not be always correct. At least for writeback mode, this recovery
>> logic is buggy. Return a corrupted/stale data in silence is disaster,
>> this logic should be fixed.
> 
> I think we are saying the same thing.  Always return valid data or -EIO.
> I'm just suggesting that the -EIO path from the cache device should be the
> same logic whether it is because the driver returned -EIO or because the
> cache device is missing.
> 

Yes, they are same code path. Bcache does not handle cache device 
failure now, I/O failure or cache device failure are all non-zero 
bi_status in bi_end_io call back. Yes, we think in same way here.


>>>
>>>> 3) If not in writeback mode, and dirty map is not clean. It means the
>>>> cache mode is switched from writeback mode with dirty data lost, then
>>>>     - returns -EIO immediately for WRITE request
>>>>     - returns -EIO immediately for READ request (*)
>>>
>>> For #2,3, do a dirty map lookup for every IO: if the block is clean, pass
>>> it to the backing device.  Only -EIO if the request cannot be recovered
>>> (block is dirty) and invoke pr_crit_once() to notify the user.  We want
>>> all IO requests to succeed to the extent possible.
>>
>> For SSD, especially industry level NVMe SSD, there are quite a lot
>> redundant capacity inside. If an internal storage unit failed, SSD
>> controller will recovery the broken unit by map its LBA to another
>> internal storage unit. Which means if we encounter consistent bad block
>> on NVMe SSD, this is an important warning, because this device has no
>> internal space to remap and will die very soon.
> 
> Interesting, true!
> 
> What does bcache do currently if the cache device still exists (is not
> missing) but is getting -EIO's from the driver?
>   

Currently they are same code path, bcache code only receives non-zero 
error from bi_status, we cannot know it is from a failed cache device, 
or just an I/O error returned from device.


>> 5 years ago, I know some PCIe SSDs had 30%~50% capacity more internally
>> for better write performance (more space to avoid synchronized garbage
>> collection). If a bad block returned from such SSD, the situation is
>> similar to a hard disk reports 25%~30% sectors are bad. Definitely
>> people should replace the SSD as soon as possible.
> 
> Indeed!
> 
>>> I think #3 is the same case as #2.  The logic is the same whether its is
>>> now or ever was in writeback mode, regardless of the current mode.
>>>
>>>> (*) NOTE:
>>>> A sysfs entry "recovery_io_error" can be add here, which is disabled as
>>>> default. If it is enabled, if a READ request does not hit dirty map,
>>>> bcache will provide it from backing device.
>>>
>>> Resilience first!  This should default on.
>>
>> For any storage system, data health is always first priority. Providing
>> corrupted data is unacceptable.
>>
>> If bcache cannot recovery a correct data, returning a stale data by
>> recovery is insane from view of users, we should not make it happen.
> 
> Agreed.
>   
>> But if people use bcache to accelerate applications on laptop with cheap
>> SSD, they care about performance much more then data consistency. And
>> this kind of use case should represent most of bcache development in the
>> world. Hmm, I cannot find a way to refute you... you are right, this
>> should default on.
>>
>> How about this,
>> we make this configure default on, and explain in code and document why
>> it exists and in which condition it should be disabled. Then people may
>> disable it when they seriously cares about data consistency.
>>
>> Do you think it is reasonable ?
> 
> Definitely!
> 
>> [cut pagecache bits]
> 
> (Nevermind what I said about the pagecache, it only confuses things.
> Lets stick to the block layer :) )
> 
>>> Since the dirty map is still in memory, that information is
>>> useful for recovery.  After a reboot the dirty map is lost---and with it
>>> the data about what is consistent and what is not.
>>>
>>> For example, if LVM snapshots sit atop of the bcache volume, then you
>>> could `dd` them off.  If you hit an error, you know that copy is at least
>>> partially inconsistent and can try an older snapshot until one is found
>>> which is old enough to be 100% consistent.  Without the dirty map, you
>>> would only be guessing at which volume is actually consistent.
>>>
>>> Let users set recovery_io_error=0 for those who really want to fail early.
>>
>> Yes, your suggestion is great, let's set recovery_io_error=1 as default.
>>
>>  From the above discussion, if:
>> - recovery_io_error=1 is set as default.
>> - dirty map should still be check in "non" mode (this is any mode that
>> you suggested), and handled properly.
>> do you think there is any thing more to fix on my proposal ?
> 
> I think you proposal is great, thank you for all of the work on this!

Thanks for your review on this proposal. Now I start to compose another 
patch for you all to review.
Eric Wheeler July 26, 2017, 8:08 p.m. UTC | #11
On Thu, 27 Jul 2017, Coly Li wrote:
> On 07/25/2017 03:19 AM, Eric Wheeler wrote:
> > On Thu, 13 Jul 2017, Coly Li wrote:
> > 
> > > What we are discussing is more then the original patch, now the topic
> > > changes to how handling cache device disconnection more properly. So I
> > > change email thread subject.
> > >
> > > On 2017/7/13 上午8:53, Eric Wheeler wrote:
> > > > On Wed, 12 Jul 2017, Coly Li wrote:
> > > >
> > > > > On 2017/7/12 上午10:01, tang.junhui@zte.com.cn wrote:
> > > > > > > I meant "it is very necessary for data base applications which
> > > > > > > always
> > > > > > > use *writeback* mode and not switch to other mode during all their
> > > > > > > online time."  ^_^
> > > > > >
> > > > > > I know, it is necessary, but not enough. who can promise they will
> > > > > > not
> > > > > > switch during online time? This patch is logical imperfectly.
> > > > >
> > > > > Yes, I agree with you. Since Eric mentions dirty data map, an improved
> > > > > fix shows up in my head,
> > > > >
> > > > > When cache device disconnected from system,
> > > > > 0) If in non mode, do nothing.
> > > >
> > > > Does non mode guarantee that nothing is dirty?  I'm not convinced of
> > > > that.
> > > > I think you can set non mode with dirty blocks. (Correct me if I'm wrong
> > > > here.)
> > > >
> > >
> > > I think you are correct. Your question notices me, that it is still
> > > possible that user switches cache mode more then once as they want,
> > > maybe some sequence like this,
> > >   writeback -> writethrough -> none -> writeback -> none ....
> > > So we should always check whether dirty map exists or clean, no matter
> > > what current cache mode is.
> > >
> > > Nice hint :-)
> > >
> > >
> > > > > 1) If in writeback/writethough/writearound mode, and dirty map is
> > > > > clean,
> > > > >     - switch to non mode
> > > > >     - continue to handle I/O without cache device
> > > >
> > > > Sure, that makes sense.
> > > >
> > > > > 2) If in writeback mode, and dirty map is not clean,
> > > >
> > > > You would want to do a dirty map lookup for each IO.  How about this:
> > > >
> > > > 2) If in _any_ mode, and dirty map is dirty for *that specific block*:
> > > >
> > > >    If WRITE request completely overlaps the dirty segment then clear
> > > >    	the dirty flag and pass through to the backing dev.
> > > >    otherwise:
> > > >      - return -EIO immediately for WRITE request
> > > >      - return -EIO immediately for READ request (*)
> > > >
> > > > If the WRITE request completely overlaps the dirty segment as indicated
> > > > from the in-memory metadata, then clear its dirty flag and write to the
> > > > backing device.  Whatever was dirty isn't important anymore as it was
> > > > overwritten.
> > >
> > > What I worried here is, the lost dirty data blocks is probably to have
> > > inter-dependency, e.g. file system metadata (maybe database transaction
> > > records).
> > >
> > > If the cache device disconnected and a single overlapped dirty block is
> > > permitted to go into backing device. It may cause a more worse metadata
> > > corruption and data lose on backing device.
> > 
> > I think we might be using the term "overlapped" in two different ways: I
> > think you mean overlap as a request which only partially overwrites the
> > data which is dirty.  My meaning for overlap was that it completely
> > overlaps the dirty data, that is, the whole block specified by the WRITE
> > request is already dirty at the time of submission and exactly matches
> > what the dirty map indicates such that clearing the dirty map for the
> > request does not clear the dirty map for any dirty data that is not
> > overwritten by the request.
> > 
> > I agree that WRITE requests which do not fully overwrite the dirty block
> > must -EIO.
> 
> I understand what you meant on "overlap", my reply was not detailed enough,
> let me explain my idea again.
> 
> I mentioned, "the lost dirty data blocks is probably to have
> inter-dependency, e.g. file system metadata". That is, even a failed request
> exactly overlaps with a dirty key range, permitting it goes into disk platter
> may not be correct. Because if the dirty data is file system metadata (indeed
> bcache tends to keep metadata in cache), and cache device disconnects from
> system. In this case, file system on cached device may be corrupted already,
> permit one or more dirty blocks hitting disk platter does not help, maybe make
> things worse.
> 
> Imaging cached device is formatted as Ext4 file system, and it is attached to
> cache set. During an offline fsck running on the bcache device, and cache
> device disconnected. In this case, almost all modified metadata blocks are on
> disconnected cache device. At this moment, dirty data (on cache device) on top
> of cached device is a consistent file system; only cached device may contain a
> consistent or recoverable file system. But if fsck continue to write meta
> blocks and bcache permits them hitting disk platter (because of exact key
> range overlap), then the file system on cached device will be in a undefined
> inter medium status. Which may mislead further fsck operations and make a
> worse file system corruption.
> 
> So when a cache device disconnects from system, unless we have following dirty
> data exactly overlaps all dirty key ranges, otherwise we should not permit it
> (them) hitting disk platter. It is possible that a write request just exactly
> overlaps existing dirty key ranges, e.g. only one dirty key and it exactly
> matches LBA range of the write request.

That makes sense, I understand you now.  

My premise was that the backing device is seriously corrupt with 
gigabytes of dirty data, so might as well let it write if it can and 
continue to operate hot until it dies completely.  In either case, they 
had better have backups!

If one expects to attempt recovery from the writeback-cached 
backing-device, then your argument is reasonable IIF the backing device is 
even fsckable from the such that it can recover data.

--
Eric Wheeler


> 
> > 
> > We can only write to the backing device if the WRITE request being made is
> > to an offset+length that is completely dirty, in which case the related
> > cache block in-memory dirty flag can be cleared.  It must completely match
> > the dirty block size so the write complete replaces the dirty area.
> > 
> > Does this seem correct?  If not, please suggest an example to illustrate.
> > 
> 
> I provide a fsck example in above lines. Indeed when there is dependency
> within on-disk data structure, e.g. a tree structure, on disk heartbeat map,
> permitting an exactly overlapped write request to overwrite cached disk is
> very probably problematic.
> 
> 
> > > >
> > > > Unless there is a good reason to diverge, we would want this recovery
> > > > logic would be the same for failed IOs from an existing cachedev (eg,
> > > > with
> > > > badblocks), and for cachedevs that are altogether missing.
> > > >
> > >
> > > For clean data lost, this is totally correct. For dirty data lost, it
> > > might not be always correct. At least for writeback mode, this recovery
> > > logic is buggy. Return a corrupted/stale data in silence is disaster,
> > > this logic should be fixed.
> > 
> > I think we are saying the same thing.  Always return valid data or -EIO.
> > I'm just suggesting that the -EIO path from the cache device should be the
> > same logic whether it is because the driver returned -EIO or because the
> > cache device is missing.
> > 
> 
> Yes, they are same code path. Bcache does not handle cache device failure now,
> I/O failure or cache device failure are all non-zero bi_status in bi_end_io
> call back. Yes, we think in same way here.
> 
> 
> > > >
> > > > > 3) If not in writeback mode, and dirty map is not clean. It means the
> > > > > cache mode is switched from writeback mode with dirty data lost, then
> > > > >     - returns -EIO immediately for WRITE request
> > > > >     - returns -EIO immediately for READ request (*)
> > > >
> > > > For #2,3, do a dirty map lookup for every IO: if the block is clean,
> > > > pass
> > > > it to the backing device.  Only -EIO if the request cannot be recovered
> > > > (block is dirty) and invoke pr_crit_once() to notify the user.  We want
> > > > all IO requests to succeed to the extent possible.
> > >
> > > For SSD, especially industry level NVMe SSD, there are quite a lot
> > > redundant capacity inside. If an internal storage unit failed, SSD
> > > controller will recovery the broken unit by map its LBA to another
> > > internal storage unit. Which means if we encounter consistent bad block
> > > on NVMe SSD, this is an important warning, because this device has no
> > > internal space to remap and will die very soon.
> > 
> > Interesting, true!
> > 
> > What does bcache do currently if the cache device still exists (is not
> > missing) but is getting -EIO's from the driver?
> >   
> 
> Currently they are same code path, bcache code only receives non-zero error
> from bi_status, we cannot know it is from a failed cache device, or just an
> I/O error returned from device.
> 
> 
> > > 5 years ago, I know some PCIe SSDs had 30%~50% capacity more internally
> > > for better write performance (more space to avoid synchronized garbage
> > > collection). If a bad block returned from such SSD, the situation is
> > > similar to a hard disk reports 25%~30% sectors are bad. Definitely
> > > people should replace the SSD as soon as possible.
> > 
> > Indeed!
> > 
> > > > I think #3 is the same case as #2.  The logic is the same whether its is
> > > > now or ever was in writeback mode, regardless of the current mode.
> > > >
> > > > > (*) NOTE:
> > > > > A sysfs entry "recovery_io_error" can be add here, which is disabled
> > > > > as
> > > > > default. If it is enabled, if a READ request does not hit dirty map,
> > > > > bcache will provide it from backing device.
> > > >
> > > > Resilience first!  This should default on.
> > >
> > > For any storage system, data health is always first priority. Providing
> > > corrupted data is unacceptable.
> > >
> > > If bcache cannot recovery a correct data, returning a stale data by
> > > recovery is insane from view of users, we should not make it happen.
> > 
> > Agreed.
> >   
> > > But if people use bcache to accelerate applications on laptop with cheap
> > > SSD, they care about performance much more then data consistency. And
> > > this kind of use case should represent most of bcache development in the
> > > world. Hmm, I cannot find a way to refute you... you are right, this
> > > should default on.
> > >
> > > How about this,
> > > we make this configure default on, and explain in code and document why
> > > it exists and in which condition it should be disabled. Then people may
> > > disable it when they seriously cares about data consistency.
> > >
> > > Do you think it is reasonable ?
> > 
> > Definitely!
> > 
> > > [cut pagecache bits]
> > 
> > (Nevermind what I said about the pagecache, it only confuses things.
> > Lets stick to the block layer :) )
> > 
> > > > Since the dirty map is still in memory, that information is
> > > > useful for recovery.  After a reboot the dirty map is lost---and with it
> > > > the data about what is consistent and what is not.
> > > >
> > > > For example, if LVM snapshots sit atop of the bcache volume, then you
> > > > could `dd` them off.  If you hit an error, you know that copy is at
> > > > least
> > > > partially inconsistent and can try an older snapshot until one is found
> > > > which is old enough to be 100% consistent.  Without the dirty map, you
> > > > would only be guessing at which volume is actually consistent.
> > > >
> > > > Let users set recovery_io_error=0 for those who really want to fail
> > > > early.
> > >
> > > Yes, your suggestion is great, let's set recovery_io_error=1 as default.
> > >
> > > From the above discussion, if:
> > > - recovery_io_error=1 is set as default.
> > > - dirty map should still be check in "non" mode (this is any mode that
> > > you suggested), and handled properly.
> > > do you think there is any thing more to fix on my proposal ?
> > 
> > I think you proposal is great, thank you for all of the work on this!
> 
> Thanks for your review on this proposal. Now I start to compose another patch
> for you all to review.
> 
> -- 
> Coly Li
>
diff mbox

Patch

diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
index 019b3df9f1c6..6edacac9b00d 100644
--- a/drivers/md/bcache/request.c
+++ b/drivers/md/bcache/request.c
@@ -702,8 +702,11 @@  static void cached_dev_read_error(struct closure *cl)
 {
 	struct search *s = container_of(cl, struct search, cl);
 	struct bio *bio = &s->bio.bio;
+	struct cached_dev *dc = container_of(s->d, struct cached_dev, disk);
+	unsigned mode = cache_mode(dc, NULL);
 
-	if (s->recoverable) {
+	if (s->recoverable &&
+	    (mode == CACHE_MODE_WRITETHROUGH)) {
 		/* Retry from the backing device: */
 		trace_bcache_read_retry(s->orig_bio);