Message ID | 20161121085016.7148-1-quwenruo@cn.fujitsu.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Hi Qu, I tested this succefully for RAID5 when doing a scrub (i.e.: I mount a corrupted disks, then I ran "btrfs scrub start ...", then I check the disks). However if I do a "cat mnt/out.txt" (out.txt is the corrupted file): 1) the system detect that the file is corrupted (good :) ) 2) the system return the correct file content (good :) ) 3) the data on the platter are still wrong (no good :( ) Enclosed the script which reproduces the problem. Note that: If I corrupt the data, in the dmesg two time appears a line which says: [ 3963.763384] BTRFS warning (device loop2): csum failed ino 257 off 0 csum 2280586218 expected csum 3192393815 [ 3963.766927] BTRFS warning (device loop2): csum failed ino 257 off 0 csum 2280586218 expected csum 3192393815 If I corrupt the parity, of course the system doesn't detect the corruption nor try to correct it. But this is the expected behavior. BR G.Baroncelli On 2016-11-21 09:50, Qu Wenruo wrote: > In the following situation, scrub will calculate wrong parity to > overwrite correct one: > > RAID5 full stripe: > > Before > | Dev 1 | Dev 2 | Dev 3 | > | Data stripe 1 | Data stripe 2 | Parity Stripe | > --------------------------------------------------- 0 > | 0x0000 (Bad) | 0xcdcd | 0x0000 | > --------------------------------------------------- 4K > | 0xcdcd | 0xcdcd | 0x0000 | > ... > | 0xcdcd | 0xcdcd | 0x0000 | > --------------------------------------------------- 64K > > After scrubbing dev3 only: > > | Dev 1 | Dev 2 | Dev 3 | > | Data stripe 1 | Data stripe 2 | Parity Stripe | > --------------------------------------------------- 0 > | 0xcdcd (Good) | 0xcdcd | 0xcdcd (Bad) | > --------------------------------------------------- 4K > | 0xcdcd | 0xcdcd | 0x0000 | > ... > | 0xcdcd | 0xcdcd | 0x0000 | > --------------------------------------------------- 64K > > The calltrace of such corruption is as following: > > scrub_bio_end_io_worker() get called for each extent read out > |- scriub_block_complete() > |- Data extent csum mismatch > |- scrub_handle_errored_block > |- scrub_recheck_block() > |- scrub_submit_raid56_bio_wait() > |- raid56_parity_recover() > > Now we have a rbio with correct data stripe 1 recovered. > Let's call it "good_rbio". > > scrub_parity_check_and_repair() > |- raid56_parity_submit_scrub_rbio() > |- lock_stripe_add() > | |- steal_rbio() > | |- Recovered data are steal from "good_rbio", stored into > | rbio->stripe_pages[] > | Now rbio->bio_pages[] are bad data read from disk. > |- async_scrub_parity() > |- scrub_parity_work() (delayed_call to scrub_parity_work) > > scrub_parity_work() > |- raid56_parity_scrub_stripe() > |- validate_rbio_for_parity_scrub() > |- finish_parity_scrub() > |- Recalculate parity using *BAD* pages in rbio->bio_pages[] > So good parity is overwritten with *BAD* one > > The fix is to introduce 2 new members, bad_ondisk_a/b, to struct > btrfs_raid_bio, to info scrub code to use correct data pages to > re-calculate parity. > > Reported-by: Goffredo Baroncelli <kreijack@inwind.it> > Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> > --- > Thanks to the above hell of delayed function all and damn stupid code > logical, such bug is quite hard to trace. > > The damn kernel scrub is already multi-thread, why do such meaningless > delayed function call again and again? > > What's wrong with single thread scrub? > We can do thing like in each stripe for raid56 which is easy and > straightforward, only delayed thing is to wake up waiter: > > lock_full_stripe() > if (!is_parity_stripe()) { > prepare_data_stripe_bios() > submit_and_wait_bios() > if (check_csum() == 0) > goto out; > } > prepare_full_stripe_bios() > submit_and_wait_bios() > > recover_raid56_stipres(); > prepare_full_stripe_write_bios() > submit_and_wait_bios() > > out: > unlock_full_stripe() > > We really need to re-work the whole damn scrub code. > > Also, we need to enhance btrfs-progs to detect scrub problem(my > submitted offline scrub is good enough for such usage), and tools to > corrupt extents reliably to put it into xfstests test cases. > > RAID56 scrub code is neither tested nor well-designed. > --- > fs/btrfs/raid56.c | 29 ++++++++++++++++++++++++++++- > 1 file changed, 28 insertions(+), 1 deletion(-) > > diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c > index d016d4a..87e3565 100644 > --- a/fs/btrfs/raid56.c > +++ b/fs/btrfs/raid56.c > @@ -133,6 +133,16 @@ struct btrfs_raid_bio { > /* second bad stripe (for raid6 use) */ > int failb; > > + /* > + * For steal_rbio, we can steal recovered correct page, > + * but in finish_parity_scrub(), we still use bad on-disk > + * page to calculate parity. > + * Use these members to info finish_parity_scrub() to use > + * correct pages > + */ > + int bad_ondisk_a; > + int bad_ondisk_b; > + > int scrubp; > /* > * number of pages needed to represent the full > @@ -310,6 +320,12 @@ static void steal_rbio(struct btrfs_raid_bio *src, struct btrfs_raid_bio *dest) > if (!test_bit(RBIO_CACHE_READY_BIT, &src->flags)) > return; > > + /* Record recovered stripe number */ > + if (src->faila != -1) > + dest->bad_ondisk_a = src->faila; > + if (src->failb != -1) > + dest->bad_ondisk_b = src->failb; > + > for (i = 0; i < dest->nr_pages; i++) { > s = src->stripe_pages[i]; > if (!s || !PageUptodate(s)) { > @@ -998,6 +1014,8 @@ static struct btrfs_raid_bio *alloc_rbio(struct btrfs_root *root, > rbio->stripe_npages = stripe_npages; > rbio->faila = -1; > rbio->failb = -1; > + rbio->bad_ondisk_a = -1; > + rbio->bad_ondisk_b = -1; > atomic_set(&rbio->refs, 1); > atomic_set(&rbio->error, 0); > atomic_set(&rbio->stripes_pending, 0); > @@ -2352,7 +2370,16 @@ static noinline void finish_parity_scrub(struct btrfs_raid_bio *rbio, > void *parity; > /* first collect one page from each data stripe */ > for (stripe = 0; stripe < nr_data; stripe++) { > - p = page_in_rbio(rbio, stripe, pagenr, 0); > + > + /* > + * Use stolen recovered page other than bad > + * on disk pages > + */ > + if (stripe == rbio->bad_ondisk_a || > + stripe == rbio->bad_ondisk_b) > + p = rbio_stripe_page(rbio, stripe, pagenr); > + else > + p = page_in_rbio(rbio, stripe, pagenr, 0); > pointers[stripe] = kmap(p); > } > >
At 11/22/2016 02:48 AM, Goffredo Baroncelli wrote: > Hi Qu, > > I tested this succefully for RAID5 when doing a scrub (i.e.: I mount a corrupted disks, then I ran "btrfs scrub start ...", then I check the disks). > > However if I do a "cat mnt/out.txt" (out.txt is the corrupted file): > 1) the system detect that the file is corrupted (good :) ) > 2) the system return the correct file content (good :) ) > 3) the data on the platter are still wrong (no good :( ) Do you mean, read the corrupted data won't repair it? IIRC that's the designed behavior. For RAID5/6 read, there are several different mode, like READ_REBUILD or SCRUB_PARITY. I'm not sure for write, but for read it won't write correct data. So it's a designed behavior if I don't miss something. Thanks, Qu > > > Enclosed the script which reproduces the problem. Note that: > If I corrupt the data, in the dmesg two time appears a line which says: > > [ 3963.763384] BTRFS warning (device loop2): csum failed ino 257 off 0 csum 2280586218 expected csum 3192393815 > [ 3963.766927] BTRFS warning (device loop2): csum failed ino 257 off 0 csum 2280586218 expected csum 3192393815 > > If I corrupt the parity, of course the system doesn't detect the corruption nor try to correct it. But this is the expected behavior. > > BR > G.Baroncelli > > > > On 2016-11-21 09:50, Qu Wenruo wrote: >> In the following situation, scrub will calculate wrong parity to >> overwrite correct one: >> >> RAID5 full stripe: >> >> Before >> | Dev 1 | Dev 2 | Dev 3 | >> | Data stripe 1 | Data stripe 2 | Parity Stripe | >> --------------------------------------------------- 0 >> | 0x0000 (Bad) | 0xcdcd | 0x0000 | >> --------------------------------------------------- 4K >> | 0xcdcd | 0xcdcd | 0x0000 | >> ... >> | 0xcdcd | 0xcdcd | 0x0000 | >> --------------------------------------------------- 64K >> >> After scrubbing dev3 only: >> >> | Dev 1 | Dev 2 | Dev 3 | >> | Data stripe 1 | Data stripe 2 | Parity Stripe | >> --------------------------------------------------- 0 >> | 0xcdcd (Good) | 0xcdcd | 0xcdcd (Bad) | >> --------------------------------------------------- 4K >> | 0xcdcd | 0xcdcd | 0x0000 | >> ... >> | 0xcdcd | 0xcdcd | 0x0000 | >> --------------------------------------------------- 64K >> >> The calltrace of such corruption is as following: >> >> scrub_bio_end_io_worker() get called for each extent read out >> |- scriub_block_complete() >> |- Data extent csum mismatch >> |- scrub_handle_errored_block >> |- scrub_recheck_block() >> |- scrub_submit_raid56_bio_wait() >> |- raid56_parity_recover() >> >> Now we have a rbio with correct data stripe 1 recovered. >> Let's call it "good_rbio". >> >> scrub_parity_check_and_repair() >> |- raid56_parity_submit_scrub_rbio() >> |- lock_stripe_add() >> | |- steal_rbio() >> | |- Recovered data are steal from "good_rbio", stored into >> | rbio->stripe_pages[] >> | Now rbio->bio_pages[] are bad data read from disk. >> |- async_scrub_parity() >> |- scrub_parity_work() (delayed_call to scrub_parity_work) >> >> scrub_parity_work() >> |- raid56_parity_scrub_stripe() >> |- validate_rbio_for_parity_scrub() >> |- finish_parity_scrub() >> |- Recalculate parity using *BAD* pages in rbio->bio_pages[] >> So good parity is overwritten with *BAD* one >> >> The fix is to introduce 2 new members, bad_ondisk_a/b, to struct >> btrfs_raid_bio, to info scrub code to use correct data pages to >> re-calculate parity. >> >> Reported-by: Goffredo Baroncelli <kreijack@inwind.it> >> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> >> --- >> Thanks to the above hell of delayed function all and damn stupid code >> logical, such bug is quite hard to trace. >> >> The damn kernel scrub is already multi-thread, why do such meaningless >> delayed function call again and again? >> >> What's wrong with single thread scrub? >> We can do thing like in each stripe for raid56 which is easy and >> straightforward, only delayed thing is to wake up waiter: >> >> lock_full_stripe() >> if (!is_parity_stripe()) { >> prepare_data_stripe_bios() >> submit_and_wait_bios() >> if (check_csum() == 0) >> goto out; >> } >> prepare_full_stripe_bios() >> submit_and_wait_bios() >> >> recover_raid56_stipres(); >> prepare_full_stripe_write_bios() >> submit_and_wait_bios() >> >> out: >> unlock_full_stripe() >> >> We really need to re-work the whole damn scrub code. >> >> Also, we need to enhance btrfs-progs to detect scrub problem(my >> submitted offline scrub is good enough for such usage), and tools to >> corrupt extents reliably to put it into xfstests test cases. >> >> RAID56 scrub code is neither tested nor well-designed. >> --- >> fs/btrfs/raid56.c | 29 ++++++++++++++++++++++++++++- >> 1 file changed, 28 insertions(+), 1 deletion(-) >> >> diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c >> index d016d4a..87e3565 100644 >> --- a/fs/btrfs/raid56.c >> +++ b/fs/btrfs/raid56.c >> @@ -133,6 +133,16 @@ struct btrfs_raid_bio { >> /* second bad stripe (for raid6 use) */ >> int failb; >> >> + /* >> + * For steal_rbio, we can steal recovered correct page, >> + * but in finish_parity_scrub(), we still use bad on-disk >> + * page to calculate parity. >> + * Use these members to info finish_parity_scrub() to use >> + * correct pages >> + */ >> + int bad_ondisk_a; >> + int bad_ondisk_b; >> + >> int scrubp; >> /* >> * number of pages needed to represent the full >> @@ -310,6 +320,12 @@ static void steal_rbio(struct btrfs_raid_bio *src, struct btrfs_raid_bio *dest) >> if (!test_bit(RBIO_CACHE_READY_BIT, &src->flags)) >> return; >> >> + /* Record recovered stripe number */ >> + if (src->faila != -1) >> + dest->bad_ondisk_a = src->faila; >> + if (src->failb != -1) >> + dest->bad_ondisk_b = src->failb; >> + >> for (i = 0; i < dest->nr_pages; i++) { >> s = src->stripe_pages[i]; >> if (!s || !PageUptodate(s)) { >> @@ -998,6 +1014,8 @@ static struct btrfs_raid_bio *alloc_rbio(struct btrfs_root *root, >> rbio->stripe_npages = stripe_npages; >> rbio->faila = -1; >> rbio->failb = -1; >> + rbio->bad_ondisk_a = -1; >> + rbio->bad_ondisk_b = -1; >> atomic_set(&rbio->refs, 1); >> atomic_set(&rbio->error, 0); >> atomic_set(&rbio->stripes_pending, 0); >> @@ -2352,7 +2370,16 @@ static noinline void finish_parity_scrub(struct btrfs_raid_bio *rbio, >> void *parity; >> /* first collect one page from each data stripe */ >> for (stripe = 0; stripe < nr_data; stripe++) { >> - p = page_in_rbio(rbio, stripe, pagenr, 0); >> + >> + /* >> + * Use stolen recovered page other than bad >> + * on disk pages >> + */ >> + if (stripe == rbio->bad_ondisk_a || >> + stripe == rbio->bad_ondisk_b) >> + p = rbio_stripe_page(rbio, stripe, pagenr); >> + else >> + p = page_in_rbio(rbio, stripe, pagenr, 0); >> pointers[stripe] = kmap(p); >> } >> >> > > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 2016-11-22 01:28, Qu Wenruo wrote: > > > At 11/22/2016 02:48 AM, Goffredo Baroncelli wrote: >> Hi Qu, >> >> I tested this succefully for RAID5 when doing a scrub (i.e.: I mount a corrupted disks, then I ran "btrfs scrub start ...", then I check the disks). >> >> However if I do a "cat mnt/out.txt" (out.txt is the corrupted file): >> 1) the system detect that the file is corrupted (good :) ) >> 2) the system return the correct file content (good :) ) >> 3) the data on the platter are still wrong (no good :( ) > > Do you mean, read the corrupted data won't repair it? > > IIRC that's the designed behavior. :O You are right... I was unaware of that.... So you can add a "tested-by: Goffredo Baroncelli <kreijack@inwind.it>" BR G.Baroncelli > > For RAID5/6 read, there are several different mode, like READ_REBUILD or SCRUB_PARITY. > > I'm not sure for write, but for read it won't write correct data. > > So it's a designed behavior if I don't miss something. > > Thanks, > Qu > >> >> >> Enclosed the script which reproduces the problem. Note that: >> If I corrupt the data, in the dmesg two time appears a line which says: >> >> [ 3963.763384] BTRFS warning (device loop2): csum failed ino 257 off 0 csum 2280586218 expected csum 3192393815 >> [ 3963.766927] BTRFS warning (device loop2): csum failed ino 257 off 0 csum 2280586218 expected csum 3192393815 >> >> If I corrupt the parity, of course the system doesn't detect the corruption nor try to correct it. But this is the expected behavior. >> >> BR >> G.Baroncelli >> >> >> >> On 2016-11-21 09:50, Qu Wenruo wrote: >>> In the following situation, scrub will calculate wrong parity to >>> overwrite correct one: >>> >>> RAID5 full stripe: >>> >>> Before >>> | Dev 1 | Dev 2 | Dev 3 | >>> | Data stripe 1 | Data stripe 2 | Parity Stripe | >>> --------------------------------------------------- 0 >>> | 0x0000 (Bad) | 0xcdcd | 0x0000 | >>> --------------------------------------------------- 4K >>> | 0xcdcd | 0xcdcd | 0x0000 | >>> ... >>> | 0xcdcd | 0xcdcd | 0x0000 | >>> --------------------------------------------------- 64K >>> >>> After scrubbing dev3 only: >>> >>> | Dev 1 | Dev 2 | Dev 3 | >>> | Data stripe 1 | Data stripe 2 | Parity Stripe | >>> --------------------------------------------------- 0 >>> | 0xcdcd (Good) | 0xcdcd | 0xcdcd (Bad) | >>> --------------------------------------------------- 4K >>> | 0xcdcd | 0xcdcd | 0x0000 | >>> ... >>> | 0xcdcd | 0xcdcd | 0x0000 | >>> --------------------------------------------------- 64K >>> >>> The calltrace of such corruption is as following: >>> >>> scrub_bio_end_io_worker() get called for each extent read out >>> |- scriub_block_complete() >>> |- Data extent csum mismatch >>> |- scrub_handle_errored_block >>> |- scrub_recheck_block() >>> |- scrub_submit_raid56_bio_wait() >>> |- raid56_parity_recover() >>> >>> Now we have a rbio with correct data stripe 1 recovered. >>> Let's call it "good_rbio". >>> >>> scrub_parity_check_and_repair() >>> |- raid56_parity_submit_scrub_rbio() >>> |- lock_stripe_add() >>> | |- steal_rbio() >>> | |- Recovered data are steal from "good_rbio", stored into >>> | rbio->stripe_pages[] >>> | Now rbio->bio_pages[] are bad data read from disk. >>> |- async_scrub_parity() >>> |- scrub_parity_work() (delayed_call to scrub_parity_work) >>> >>> scrub_parity_work() >>> |- raid56_parity_scrub_stripe() >>> |- validate_rbio_for_parity_scrub() >>> |- finish_parity_scrub() >>> |- Recalculate parity using *BAD* pages in rbio->bio_pages[] >>> So good parity is overwritten with *BAD* one >>> >>> The fix is to introduce 2 new members, bad_ondisk_a/b, to struct >>> btrfs_raid_bio, to info scrub code to use correct data pages to >>> re-calculate parity. >>> >>> Reported-by: Goffredo Baroncelli <kreijack@inwind.it> >>> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> >>> --- >>> Thanks to the above hell of delayed function all and damn stupid code >>> logical, such bug is quite hard to trace. >>> >>> The damn kernel scrub is already multi-thread, why do such meaningless >>> delayed function call again and again? >>> >>> What's wrong with single thread scrub? >>> We can do thing like in each stripe for raid56 which is easy and >>> straightforward, only delayed thing is to wake up waiter: >>> >>> lock_full_stripe() >>> if (!is_parity_stripe()) { >>> prepare_data_stripe_bios() >>> submit_and_wait_bios() >>> if (check_csum() == 0) >>> goto out; >>> } >>> prepare_full_stripe_bios() >>> submit_and_wait_bios() >>> >>> recover_raid56_stipres(); >>> prepare_full_stripe_write_bios() >>> submit_and_wait_bios() >>> >>> out: >>> unlock_full_stripe() >>> >>> We really need to re-work the whole damn scrub code. >>> >>> Also, we need to enhance btrfs-progs to detect scrub problem(my >>> submitted offline scrub is good enough for such usage), and tools to >>> corrupt extents reliably to put it into xfstests test cases. >>> >>> RAID56 scrub code is neither tested nor well-designed. >>> --- >>> fs/btrfs/raid56.c | 29 ++++++++++++++++++++++++++++- >>> 1 file changed, 28 insertions(+), 1 deletion(-) >>> >>> diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c >>> index d016d4a..87e3565 100644 >>> --- a/fs/btrfs/raid56.c >>> +++ b/fs/btrfs/raid56.c >>> @@ -133,6 +133,16 @@ struct btrfs_raid_bio { >>> /* second bad stripe (for raid6 use) */ >>> int failb; >>> >>> + /* >>> + * For steal_rbio, we can steal recovered correct page, >>> + * but in finish_parity_scrub(), we still use bad on-disk >>> + * page to calculate parity. >>> + * Use these members to info finish_parity_scrub() to use >>> + * correct pages >>> + */ >>> + int bad_ondisk_a; >>> + int bad_ondisk_b; >>> + >>> int scrubp; >>> /* >>> * number of pages needed to represent the full >>> @@ -310,6 +320,12 @@ static void steal_rbio(struct btrfs_raid_bio *src, struct btrfs_raid_bio *dest) >>> if (!test_bit(RBIO_CACHE_READY_BIT, &src->flags)) >>> return; >>> >>> + /* Record recovered stripe number */ >>> + if (src->faila != -1) >>> + dest->bad_ondisk_a = src->faila; >>> + if (src->failb != -1) >>> + dest->bad_ondisk_b = src->failb; >>> + >>> for (i = 0; i < dest->nr_pages; i++) { >>> s = src->stripe_pages[i]; >>> if (!s || !PageUptodate(s)) { >>> @@ -998,6 +1014,8 @@ static struct btrfs_raid_bio *alloc_rbio(struct btrfs_root *root, >>> rbio->stripe_npages = stripe_npages; >>> rbio->faila = -1; >>> rbio->failb = -1; >>> + rbio->bad_ondisk_a = -1; >>> + rbio->bad_ondisk_b = -1; >>> atomic_set(&rbio->refs, 1); >>> atomic_set(&rbio->error, 0); >>> atomic_set(&rbio->stripes_pending, 0); >>> @@ -2352,7 +2370,16 @@ static noinline void finish_parity_scrub(struct btrfs_raid_bio *rbio, >>> void *parity; >>> /* first collect one page from each data stripe */ >>> for (stripe = 0; stripe < nr_data; stripe++) { >>> - p = page_in_rbio(rbio, stripe, pagenr, 0); >>> + >>> + /* >>> + * Use stolen recovered page other than bad >>> + * on disk pages >>> + */ >>> + if (stripe == rbio->bad_ondisk_a || >>> + stripe == rbio->bad_ondisk_b) >>> + p = rbio_stripe_page(rbio, stripe, pagenr); >>> + else >>> + p = page_in_rbio(rbio, stripe, pagenr, 0); >>> pointers[stripe] = kmap(p); >>> } >>> >>> >> >> > > >
On 11/21/2016 03:50 AM, Qu Wenruo wrote: > In the following situation, scrub will calculate wrong parity to > overwrite correct one: > > RAID5 full stripe: > > Before > | Dev 1 | Dev 2 | Dev 3 | > | Data stripe 1 | Data stripe 2 | Parity Stripe | > --------------------------------------------------- 0 > | 0x0000 (Bad) | 0xcdcd | 0x0000 | > --------------------------------------------------- 4K > | 0xcdcd | 0xcdcd | 0x0000 | > ... > | 0xcdcd | 0xcdcd | 0x0000 | > --------------------------------------------------- 64K > > After scrubbing dev3 only: > > | Dev 1 | Dev 2 | Dev 3 | > | Data stripe 1 | Data stripe 2 | Parity Stripe | > --------------------------------------------------- 0 > | 0xcdcd (Good) | 0xcdcd | 0xcdcd (Bad) | > --------------------------------------------------- 4K > | 0xcdcd | 0xcdcd | 0x0000 | > ... > | 0xcdcd | 0xcdcd | 0x0000 | > --------------------------------------------------- 64K > > The calltrace of such corruption is as following: > > scrub_bio_end_io_worker() get called for each extent read out > |- scriub_block_complete() > |- Data extent csum mismatch > |- scrub_handle_errored_block > |- scrub_recheck_block() > |- scrub_submit_raid56_bio_wait() > |- raid56_parity_recover() > > Now we have a rbio with correct data stripe 1 recovered. > Let's call it "good_rbio". > > scrub_parity_check_and_repair() > |- raid56_parity_submit_scrub_rbio() > |- lock_stripe_add() > | |- steal_rbio() > | |- Recovered data are steal from "good_rbio", stored into > | rbio->stripe_pages[] > | Now rbio->bio_pages[] are bad data read from disk. > |- async_scrub_parity() > |- scrub_parity_work() (delayed_call to scrub_parity_work) > > scrub_parity_work() > |- raid56_parity_scrub_stripe() > |- validate_rbio_for_parity_scrub() > |- finish_parity_scrub() > |- Recalculate parity using *BAD* pages in rbio->bio_pages[] > So good parity is overwritten with *BAD* one > > The fix is to introduce 2 new members, bad_ondisk_a/b, to struct > btrfs_raid_bio, to info scrub code to use correct data pages to > re-calculate parity. > > Reported-by: Goffredo Baroncelli <kreijack@inwind.it> > Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> > --- > Thanks to the above hell of delayed function all and damn stupid code > logical, such bug is quite hard to trace. > > The damn kernel scrub is already multi-thread, why do such meaningless > delayed function call again and again? > > What's wrong with single thread scrub? > We can do thing like in each stripe for raid56 which is easy and > straightforward, only delayed thing is to wake up waiter: > > lock_full_stripe() > if (!is_parity_stripe()) { > prepare_data_stripe_bios() > submit_and_wait_bios() > if (check_csum() == 0) > goto out; > } > prepare_full_stripe_bios() > submit_and_wait_bios() > > recover_raid56_stipres(); > prepare_full_stripe_write_bios() > submit_and_wait_bios() > > out: > unlock_full_stripe() > > We really need to re-work the whole damn scrub code. > > Also, we need to enhance btrfs-progs to detect scrub problem(my > submitted offline scrub is good enough for such usage), and tools to > corrupt extents reliably to put it into xfstests test cases. > > RAID56 scrub code is neither tested nor well-designed. Great description, thanks for tracking this down. > @@ -2352,7 +2370,16 @@ static noinline void finish_parity_scrub(struct btrfs_raid_bio *rbio, > void *parity; > /* first collect one page from each data stripe */ > for (stripe = 0; stripe < nr_data; stripe++) { > - p = page_in_rbio(rbio, stripe, pagenr, 0); > + > + /* > + * Use stolen recovered page other than bad > + * on disk pages > + */ > + if (stripe == rbio->bad_ondisk_a || > + stripe == rbio->bad_ondisk_b) > + p = rbio_stripe_page(rbio, stripe, pagenr); > + else > + p = page_in_rbio(rbio, stripe, pagenr, 0); > pointers[stripe] = kmap(p); > } > > We're changing which pages we kmap() but not which ones we kunmap(). Can you please update the kunmap loop to use this pointers array? Also it looks like this kmap is never unmapped. pointers[stripe++] = kmap(q_page); -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
At 11/23/2016 02:58 AM, Chris Mason wrote: > On 11/21/2016 03:50 AM, Qu Wenruo wrote: >> In the following situation, scrub will calculate wrong parity to >> overwrite correct one: >> >> RAID5 full stripe: >> >> Before >> | Dev 1 | Dev 2 | Dev 3 | >> | Data stripe 1 | Data stripe 2 | Parity Stripe | >> --------------------------------------------------- 0 >> | 0x0000 (Bad) | 0xcdcd | 0x0000 | >> --------------------------------------------------- 4K >> | 0xcdcd | 0xcdcd | 0x0000 | >> ... >> | 0xcdcd | 0xcdcd | 0x0000 | >> --------------------------------------------------- 64K >> >> After scrubbing dev3 only: >> >> | Dev 1 | Dev 2 | Dev 3 | >> | Data stripe 1 | Data stripe 2 | Parity Stripe | >> --------------------------------------------------- 0 >> | 0xcdcd (Good) | 0xcdcd | 0xcdcd (Bad) | >> --------------------------------------------------- 4K >> | 0xcdcd | 0xcdcd | 0x0000 | >> ... >> | 0xcdcd | 0xcdcd | 0x0000 | >> --------------------------------------------------- 64K >> >> The calltrace of such corruption is as following: >> >> scrub_bio_end_io_worker() get called for each extent read out >> |- scriub_block_complete() >> |- Data extent csum mismatch >> |- scrub_handle_errored_block >> |- scrub_recheck_block() >> |- scrub_submit_raid56_bio_wait() >> |- raid56_parity_recover() >> >> Now we have a rbio with correct data stripe 1 recovered. >> Let's call it "good_rbio". >> >> scrub_parity_check_and_repair() >> |- raid56_parity_submit_scrub_rbio() >> |- lock_stripe_add() >> | |- steal_rbio() >> | |- Recovered data are steal from "good_rbio", stored into >> | rbio->stripe_pages[] >> | Now rbio->bio_pages[] are bad data read from disk. >> |- async_scrub_parity() >> |- scrub_parity_work() (delayed_call to scrub_parity_work) >> >> scrub_parity_work() >> |- raid56_parity_scrub_stripe() >> |- validate_rbio_for_parity_scrub() >> |- finish_parity_scrub() >> |- Recalculate parity using *BAD* pages in rbio->bio_pages[] >> So good parity is overwritten with *BAD* one >> >> The fix is to introduce 2 new members, bad_ondisk_a/b, to struct >> btrfs_raid_bio, to info scrub code to use correct data pages to >> re-calculate parity. >> >> Reported-by: Goffredo Baroncelli <kreijack@inwind.it> >> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> >> --- >> Thanks to the above hell of delayed function all and damn stupid code >> logical, such bug is quite hard to trace. >> >> The damn kernel scrub is already multi-thread, why do such meaningless >> delayed function call again and again? >> >> What's wrong with single thread scrub? >> We can do thing like in each stripe for raid56 which is easy and >> straightforward, only delayed thing is to wake up waiter: >> >> lock_full_stripe() >> if (!is_parity_stripe()) { >> prepare_data_stripe_bios() >> submit_and_wait_bios() >> if (check_csum() == 0) >> goto out; >> } >> prepare_full_stripe_bios() >> submit_and_wait_bios() >> >> recover_raid56_stipres(); >> prepare_full_stripe_write_bios() >> submit_and_wait_bios() >> >> out: >> unlock_full_stripe() >> >> We really need to re-work the whole damn scrub code. >> >> Also, we need to enhance btrfs-progs to detect scrub problem(my >> submitted offline scrub is good enough for such usage), and tools to >> corrupt extents reliably to put it into xfstests test cases. >> >> RAID56 scrub code is neither tested nor well-designed. > > Great description, thanks for tracking this down. > >> @@ -2352,7 +2370,16 @@ static noinline void finish_parity_scrub(struct >> btrfs_raid_bio *rbio, >> void *parity; >> /* first collect one page from each data stripe */ >> for (stripe = 0; stripe < nr_data; stripe++) { >> - p = page_in_rbio(rbio, stripe, pagenr, 0); >> + >> + /* >> + * Use stolen recovered page other than bad >> + * on disk pages >> + */ >> + if (stripe == rbio->bad_ondisk_a || >> + stripe == rbio->bad_ondisk_b) >> + p = rbio_stripe_page(rbio, stripe, pagenr); >> + else >> + p = page_in_rbio(rbio, stripe, pagenr, 0); >> pointers[stripe] = kmap(p); >> } >> >> > > We're changing which pages we kmap() but not which ones we kunmap(). Can > you please update the kunmap loop to use this pointers array? Also it > looks like this kmap is never unmapped. Oh I forget that. I'll update it soon. This reminds me, is there any kernel debug option to trace such unmapped pages? Thanks, Qu > > pointers[stripe++] = kmap(q_page); > > -chris > > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Nov 22, 2016 at 07:02:13PM +0100, Goffredo Baroncelli wrote: > On 2016-11-22 01:28, Qu Wenruo wrote: > > > > > > At 11/22/2016 02:48 AM, Goffredo Baroncelli wrote: > >> Hi Qu, > >> > >> I tested this succefully for RAID5 when doing a scrub (i.e.: I mount a corrupted disks, then I ran "btrfs scrub start ...", then I check the disks). > >> > >> However if I do a "cat mnt/out.txt" (out.txt is the corrupted file): > >> 1) the system detect that the file is corrupted (good :) ) > >> 2) the system return the correct file content (good :) ) > >> 3) the data on the platter are still wrong (no good :( ) > > > > Do you mean, read the corrupted data won't repair it? > > > > IIRC that's the designed behavior. > > :O > > You are right... I was unaware of that.... This is correct. Ordinary reads shouldn't touch corrupt data, they should only read around it. Scrubs in read-write mode should write corrected data over the corrupt data. Read-only scrubs can only report errors without correcting them. Rewriting corrupt data outside of scrub (i.e. on every read) is a bad idea. Consider what happens if a RAM controller gets too hot: checksums start failing randomly, but the data on disk is still OK. If we tried to fix the bad data on every read, we'd probably just trash the filesystem in some cases. This risk mitigation measure does rely on admins taking a machine in this state down immediately, and also somehow knowing not to start a scrub while their RAM is failing...which is kind of an annoying requirement for the admin. > So you can add a "tested-by: Goffredo Baroncelli <kreijack@inwind.it>" > > BR > G.Baroncelli > > > > > For RAID5/6 read, there are several different mode, like READ_REBUILD or SCRUB_PARITY. > > > > I'm not sure for write, but for read it won't write correct data. > > > > So it's a designed behavior if I don't miss something. > > > > Thanks, > > Qu > > > >> > >> > >> Enclosed the script which reproduces the problem. Note that: > >> If I corrupt the data, in the dmesg two time appears a line which says: > >> > >> [ 3963.763384] BTRFS warning (device loop2): csum failed ino 257 off 0 csum 2280586218 expected csum 3192393815 > >> [ 3963.766927] BTRFS warning (device loop2): csum failed ino 257 off 0 csum 2280586218 expected csum 3192393815 > >> > >> If I corrupt the parity, of course the system doesn't detect the corruption nor try to correct it. But this is the expected behavior. > >> > >> BR > >> G.Baroncelli > >> > >> > >> > >> On 2016-11-21 09:50, Qu Wenruo wrote: > >>> In the following situation, scrub will calculate wrong parity to > >>> overwrite correct one: > >>> > >>> RAID5 full stripe: > >>> > >>> Before > >>> | Dev 1 | Dev 2 | Dev 3 | > >>> | Data stripe 1 | Data stripe 2 | Parity Stripe | > >>> --------------------------------------------------- 0 > >>> | 0x0000 (Bad) | 0xcdcd | 0x0000 | > >>> --------------------------------------------------- 4K > >>> | 0xcdcd | 0xcdcd | 0x0000 | > >>> ... > >>> | 0xcdcd | 0xcdcd | 0x0000 | > >>> --------------------------------------------------- 64K > >>> > >>> After scrubbing dev3 only: > >>> > >>> | Dev 1 | Dev 2 | Dev 3 | > >>> | Data stripe 1 | Data stripe 2 | Parity Stripe | > >>> --------------------------------------------------- 0 > >>> | 0xcdcd (Good) | 0xcdcd | 0xcdcd (Bad) | > >>> --------------------------------------------------- 4K > >>> | 0xcdcd | 0xcdcd | 0x0000 | > >>> ... > >>> | 0xcdcd | 0xcdcd | 0x0000 | > >>> --------------------------------------------------- 64K > >>> > >>> The calltrace of such corruption is as following: > >>> > >>> scrub_bio_end_io_worker() get called for each extent read out > >>> |- scriub_block_complete() > >>> |- Data extent csum mismatch > >>> |- scrub_handle_errored_block > >>> |- scrub_recheck_block() > >>> |- scrub_submit_raid56_bio_wait() > >>> |- raid56_parity_recover() > >>> > >>> Now we have a rbio with correct data stripe 1 recovered. > >>> Let's call it "good_rbio". > >>> > >>> scrub_parity_check_and_repair() > >>> |- raid56_parity_submit_scrub_rbio() > >>> |- lock_stripe_add() > >>> | |- steal_rbio() > >>> | |- Recovered data are steal from "good_rbio", stored into > >>> | rbio->stripe_pages[] > >>> | Now rbio->bio_pages[] are bad data read from disk. > >>> |- async_scrub_parity() > >>> |- scrub_parity_work() (delayed_call to scrub_parity_work) > >>> > >>> scrub_parity_work() > >>> |- raid56_parity_scrub_stripe() > >>> |- validate_rbio_for_parity_scrub() > >>> |- finish_parity_scrub() > >>> |- Recalculate parity using *BAD* pages in rbio->bio_pages[] > >>> So good parity is overwritten with *BAD* one > >>> > >>> The fix is to introduce 2 new members, bad_ondisk_a/b, to struct > >>> btrfs_raid_bio, to info scrub code to use correct data pages to > >>> re-calculate parity. > >>> > >>> Reported-by: Goffredo Baroncelli <kreijack@inwind.it> > >>> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> > >>> --- > >>> Thanks to the above hell of delayed function all and damn stupid code > >>> logical, such bug is quite hard to trace. > >>> > >>> The damn kernel scrub is already multi-thread, why do such meaningless > >>> delayed function call again and again? > >>> > >>> What's wrong with single thread scrub? > >>> We can do thing like in each stripe for raid56 which is easy and > >>> straightforward, only delayed thing is to wake up waiter: > >>> > >>> lock_full_stripe() > >>> if (!is_parity_stripe()) { > >>> prepare_data_stripe_bios() > >>> submit_and_wait_bios() > >>> if (check_csum() == 0) > >>> goto out; > >>> } > >>> prepare_full_stripe_bios() > >>> submit_and_wait_bios() > >>> > >>> recover_raid56_stipres(); > >>> prepare_full_stripe_write_bios() > >>> submit_and_wait_bios() > >>> > >>> out: > >>> unlock_full_stripe() > >>> > >>> We really need to re-work the whole damn scrub code. > >>> > >>> Also, we need to enhance btrfs-progs to detect scrub problem(my > >>> submitted offline scrub is good enough for such usage), and tools to > >>> corrupt extents reliably to put it into xfstests test cases. > >>> > >>> RAID56 scrub code is neither tested nor well-designed. > >>> --- > >>> fs/btrfs/raid56.c | 29 ++++++++++++++++++++++++++++- > >>> 1 file changed, 28 insertions(+), 1 deletion(-) > >>> > >>> diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c > >>> index d016d4a..87e3565 100644 > >>> --- a/fs/btrfs/raid56.c > >>> +++ b/fs/btrfs/raid56.c > >>> @@ -133,6 +133,16 @@ struct btrfs_raid_bio { > >>> /* second bad stripe (for raid6 use) */ > >>> int failb; > >>> > >>> + /* > >>> + * For steal_rbio, we can steal recovered correct page, > >>> + * but in finish_parity_scrub(), we still use bad on-disk > >>> + * page to calculate parity. > >>> + * Use these members to info finish_parity_scrub() to use > >>> + * correct pages > >>> + */ > >>> + int bad_ondisk_a; > >>> + int bad_ondisk_b; > >>> + > >>> int scrubp; > >>> /* > >>> * number of pages needed to represent the full > >>> @@ -310,6 +320,12 @@ static void steal_rbio(struct btrfs_raid_bio *src, struct btrfs_raid_bio *dest) > >>> if (!test_bit(RBIO_CACHE_READY_BIT, &src->flags)) > >>> return; > >>> > >>> + /* Record recovered stripe number */ > >>> + if (src->faila != -1) > >>> + dest->bad_ondisk_a = src->faila; > >>> + if (src->failb != -1) > >>> + dest->bad_ondisk_b = src->failb; > >>> + > >>> for (i = 0; i < dest->nr_pages; i++) { > >>> s = src->stripe_pages[i]; > >>> if (!s || !PageUptodate(s)) { > >>> @@ -998,6 +1014,8 @@ static struct btrfs_raid_bio *alloc_rbio(struct btrfs_root *root, > >>> rbio->stripe_npages = stripe_npages; > >>> rbio->faila = -1; > >>> rbio->failb = -1; > >>> + rbio->bad_ondisk_a = -1; > >>> + rbio->bad_ondisk_b = -1; > >>> atomic_set(&rbio->refs, 1); > >>> atomic_set(&rbio->error, 0); > >>> atomic_set(&rbio->stripes_pending, 0); > >>> @@ -2352,7 +2370,16 @@ static noinline void finish_parity_scrub(struct btrfs_raid_bio *rbio, > >>> void *parity; > >>> /* first collect one page from each data stripe */ > >>> for (stripe = 0; stripe < nr_data; stripe++) { > >>> - p = page_in_rbio(rbio, stripe, pagenr, 0); > >>> + > >>> + /* > >>> + * Use stolen recovered page other than bad > >>> + * on disk pages > >>> + */ > >>> + if (stripe == rbio->bad_ondisk_a || > >>> + stripe == rbio->bad_ondisk_b) > >>> + p = rbio_stripe_page(rbio, stripe, pagenr); > >>> + else > >>> + p = page_in_rbio(rbio, stripe, pagenr, 0); > >>> pointers[stripe] = kmap(p); > >>> } > >>> > >>> > >> > >> > > > > > > > > > -- > gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it> > Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 >
On Fri, Nov 25, 2016 at 3:31 PM, Zygo Blaxell <ce3g8jdj@umail.furryterror.org> wrote: > > This risk mitigation measure does rely on admins taking a machine in this > state down immediately, and also somehow knowing not to start a scrub > while their RAM is failing...which is kind of an annoying requirement > for the admin. Attempting to detect if RAM is bad when scrub starts is both time consuming and not very reliable right. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Nov 25, 2016 at 03:40:36PM +1100, Gareth Pye wrote: > On Fri, Nov 25, 2016 at 3:31 PM, Zygo Blaxell > <ce3g8jdj@umail.furryterror.org> wrote: > > > > This risk mitigation measure does rely on admins taking a machine in this > > state down immediately, and also somehow knowing not to start a scrub > > while their RAM is failing...which is kind of an annoying requirement > > for the admin. > > Attempting to detect if RAM is bad when scrub starts is both time > consuming and not very reliable right. RAM, like all hardware, could fail at any time, and a scrub could already be running when it happens. This is annoying but also a fact of life that admins have to deal with. Testing RAM before scrub starts is not more beneficial than testing RAM at random intervals--but if you are testing RAM at random intervals, why not do it at the same intervals as scrub? If I see corruption errors showing up in stats, I will do a basic sanity test to make sure they're coming from the storage layer and not somewhere closer to the CPU. If all errors come from one device and there are clear log messages showing SCSI device errors and the SMART log matches the other data, RAM is probably not the root case of failures, so scrub away. If normally reliable programs like /bin/sh start randomly segfaulting, there's smoke pouring out of the back of the machine, all the disks are full of csum failures, and the BIOS welcome message has spelling errors that weren't there before, I would *not* start a scrub. More like turn the machine off, take it apart, test all the pieces separately, and only do a scrub after everything above the storage layer had been replaced or recertified. I certainly wouldn't want the filesystem to try to fix the csum failures it finds in such situations.
On 2016-11-25 05:31, Zygo Blaxell wrote: >>> Do you mean, read the corrupted data won't repair it? >>> >>> IIRC that's the designed behavior. >> :O >> >> You are right... I was unaware of that.... > This is correct. > > Ordinary reads shouldn't touch corrupt data, they should only read > around it. Scrubs in read-write mode should write corrected data over > the corrupt data. Read-only scrubs can only report errors without > correcting them. > > Rewriting corrupt data outside of scrub (i.e. on every read) is a > bad idea. Consider what happens if a RAM controller gets too hot: > checksums start failing randomly, but the data on disk is still OK. > If we tried to fix the bad data on every read, we'd probably just trash > the filesystem in some cases. I cant agree. If the filesystem is mounted read-only this behavior may be correct; bur in others cases I don't see any reason to not correct wrong data even in the read case. If your ram is unreliable you have big problem anyway. The likelihood that the data contained in a disk is "corrupted" is higher than the likelihood that the RAM is bad. BTW Btrfs in RAID1 mode corrects the data even in the read case. So I am still convinced that is the RAID5/6 behavior "strange". BR G.Baroncelli
On 11/22/2016 07:26 PM, Qu Wenruo wrote: > >> >> We're changing which pages we kmap() but not which ones we kunmap(). Can >> you please update the kunmap loop to use this pointers array? Also it >> looks like this kmap is never unmapped. > > Oh I forget that. > I'll update it soon. Thanks! > > This reminds me, is there any kernel debug option to trace such unmapped > pages? > I don't think so, which is surprising. It explodes so quickly on 32 bit machines, its easiest to boot it on a 32 bit qemu. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, Nov 26, 2016 at 02:12:56PM +0100, Goffredo Baroncelli wrote: > On 2016-11-25 05:31, Zygo Blaxell wrote: > >>> Do you mean, read the corrupted data won't repair it? > >>> > >>> IIRC that's the designed behavior. > >> :O > >> > >> You are right... I was unaware of that.... > > This is correct. > > > > Ordinary reads shouldn't touch corrupt data, they should only read > > around it. Scrubs in read-write mode should write corrected data over > > the corrupt data. Read-only scrubs can only report errors without > > correcting them. > > > > Rewriting corrupt data outside of scrub (i.e. on every read) is a > > bad idea. Consider what happens if a RAM controller gets too hot: > > checksums start failing randomly, but the data on disk is still OK. > > If we tried to fix the bad data on every read, we'd probably just trash > > the filesystem in some cases. > > > > I cant agree. If the filesystem is mounted read-only this behavior may > be correct; bur in others cases I don't see any reason to not correct > wrong data even in the read case. If your ram is unreliable you have > big problem anyway. If you don't like RAM corruption, pick any other failure mode. Laptops have to deal with things like vibration and temperature extremes which produce the same results (spurious csum failures and IO errors under conditions where writing will only destroy data that would otherwise be recoverable). > The likelihood that the data contained in a disk is "corrupted" is > higher than the likelihood that the RAM is bad. > > BTW Btrfs in RAID1 mode corrects the data even in the read case. So Have you tested this? I think you'll find that it doesn't. > I am still convinced that is the RAID5/6 behavior "strange". > > BR > G.Baroncelli > -- > gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it> > Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 >
On 2016-11-26 19:54, Zygo Blaxell wrote: > On Sat, Nov 26, 2016 at 02:12:56PM +0100, Goffredo Baroncelli wrote: >> On 2016-11-25 05:31, Zygo Blaxell wrote: [...] >> >> BTW Btrfs in RAID1 mode corrects the data even in the read case. So > > Have you tested this? I think you'll find that it doesn't. Yes I tested it; and it does the rebuild automatically. I corrupted a disk of mirror, then I read the related file. The log says: [ 59.287748] BTRFS warning (device vdb): csum failed ino 257 off 0 csum 12813760 expected csum 3114703128 [ 59.291542] BTRFS warning (device vdb): csum failed ino 257 off 0 csum 12813760 expected csum 3114703128 [ 59.294950] BTRFS info (device vdb): read error corrected: ino 257 off 0 (dev /dev/vdb sector 2154496) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ IIRC In case of RAID5/6 the last line is missing. However in both the case the data returned is good; but in RAID1 the data is corrected also on the disk. Where you read that the data is not rebuild automatically ? In fact I was surprised that RAID5/6 behaves differently.... > >> I am still convinced that is the RAID5/6 behavior "strange". >> >> BR >> G.Baroncelli >> -- >> gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it> >> Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 >>
On Sun, Nov 27, 2016 at 12:16:34AM +0100, Goffredo Baroncelli wrote: > On 2016-11-26 19:54, Zygo Blaxell wrote: > > On Sat, Nov 26, 2016 at 02:12:56PM +0100, Goffredo Baroncelli wrote: > >> On 2016-11-25 05:31, Zygo Blaxell wrote: > [...] > >> > >> BTW Btrfs in RAID1 mode corrects the data even in the read case. So > > > > Have you tested this? I think you'll find that it doesn't. > > Yes I tested it; and it does the rebuild automatically. > I corrupted a disk of mirror, then I read the related file. The log says: > > [ 59.287748] BTRFS warning (device vdb): csum failed ino 257 off 0 csum 12813760 expected csum 3114703128 > [ 59.291542] BTRFS warning (device vdb): csum failed ino 257 off 0 csum 12813760 expected csum 3114703128 > [ 59.294950] BTRFS info (device vdb): read error corrected: ino 257 off 0 (dev /dev/vdb sector 2154496) > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > IIRC In case of RAID5/6 the last line is missing. However in both the > case the data returned is good; but in RAID1 the data is corrected > also on the disk. > > Where you read that the data is not rebuild automatically ? Experience? I have real disk failures all the time. Errors on RAID1 arrays persist until scrubbed. No, wait... _transid_ errors always persist until scrubbed. csum failures are rewritten in repair_io_failure. There is a comment earlier in repair_io_failure that rewrite in RAID56 is not supported yet. > In fact I was surprised that RAID5/6 behaves differently.... The difference is surprising, no matter which strategy you believe is correct. ;) > >> I am still convinced that is the RAID5/6 behavior "strange". > >> > >> BR > >> G.Baroncelli > >> -- > >> gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it> > >> Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 > >> > > > -- > gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it> > Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 >
At 11/27/2016 07:16 AM, Goffredo Baroncelli wrote: > On 2016-11-26 19:54, Zygo Blaxell wrote: >> On Sat, Nov 26, 2016 at 02:12:56PM +0100, Goffredo Baroncelli wrote: >>> On 2016-11-25 05:31, Zygo Blaxell wrote: > [...] >>> >>> BTW Btrfs in RAID1 mode corrects the data even in the read case. So >> >> Have you tested this? I think you'll find that it doesn't. > > Yes I tested it; and it does the rebuild automatically. > I corrupted a disk of mirror, then I read the related file. The log says: > > [ 59.287748] BTRFS warning (device vdb): csum failed ino 257 off 0 csum 12813760 expected csum 3114703128 > [ 59.291542] BTRFS warning (device vdb): csum failed ino 257 off 0 csum 12813760 expected csum 3114703128 > [ 59.294950] BTRFS info (device vdb): read error corrected: ino 257 off 0 (dev /dev/vdb sector 2154496) > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > IIRC In case of RAID5/6 the last line is missing. However in both the case the data returned is good; but in RAID1 the data is corrected also on the disk. > > Where you read that the data is not rebuild automatically ? > > In fact I was surprised that RAID5/6 behaves differently.... > Yes, I also tried that and realized that RAID1 is recovering corrupted data at *READ* time. The main difference between RAID1 and RAID56 seems to be the complexity. For RAID56, we have different read/write behavior, for read, we use flag BTRFS_RBIO_READ_REBUILD, which will only rebuild data but not write them into disk. And I'm a little concern about the race between read time fix and write. I assume it's possible to change the behavior to follow RAID1, but I'd like to do it in the following steps: 1) Fix known RAID56 bugs With the v3 patch and previous 2 patches, it seems OK now. 2) Full fstests test case, with all possible corruption combination (WIP) 3) Rework current RAID56 code to a cleaner and more readable status (long term) 4) Add the support to fix things at read time. So the behavior change is not something we will see in short time. Thanks, Qu >> >>> I am still convinced that is the RAID5/6 behavior "strange". >>> >>> BR >>> G.Baroncelli >>> -- >>> gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it> >>> Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 >>> > > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, 2016-11-26 at 14:12 +0100, Goffredo Baroncelli wrote: > I cant agree. If the filesystem is mounted read-only this behavior > may be correct; bur in others cases I don't see any reason to not > correct wrong data even in the read case. If your ram is unreliable > you have big problem anyway. I'd agree with that - more or less. If the memory is broken, then even without repairing (on read) a filesystem that is written to will likely be further corrupted. I think for safety it's best to repair as early as possible (and thus on read when a damage is detected), as further blocks/devices may fail till eventually a scrub(with repair) would be run manually. However, there may some workloads under which such auto-repair is undesirable as it may cost performance and safety may be less important than that. Thus I think, there should be a mount-option that let users control whether repair should happen on normal reads or not... and this should IMO be independent of whether the fs was mounted ro or rw. I'd say the default should go for data safety (i.e. repair as soon as possible). Cheers, Chris.
28.11.2016 06:37, Christoph Anton Mitterer пишет: > On Sat, 2016-11-26 at 14:12 +0100, Goffredo Baroncelli wrote: >> I cant agree. If the filesystem is mounted read-only this behavior >> may be correct; bur in others cases I don't see any reason to not >> correct wrong data even in the read case. If your ram is unreliable >> you have big problem anyway. > > I'd agree with that - more or less. > > If the memory is broken, then even without repairing (on read) a > filesystem that is written to will likely be further corrupted. > > > I think for safety it's best to repair as early as possible (and thus > on read when a damage is detected), as further blocks/devices may fail > till eventually a scrub(with repair) would be run manually. > > However, there may some workloads under which such auto-repair is > undesirable as it may cost performance and safety may be less important > than that. > > Thus I think, there should be a mount-option that let users control > whether repair should happen on normal reads or not... and this should > IMO be independent of whether the fs was mounted ro or rw. If you allow any write to filesystem before resuming from hibernation you risk corrupted filesystem. I strongly believe that "ro" must be really read-only - having separate option to control it will require update of every tool that generates initramfs and even then you cannot avoid using older initramfs with newer kernel. > I'd say the default should go for data safety (i.e. repair as soon as > possible). > Safety means exactly opposite for me - do not modify data if explicitly requested not to do it. Having filesystem in half-read-only state will be too error prone. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 2016-11-28 at 06:53 +0300, Andrei Borzenkov wrote: > If you allow any write to filesystem before resuming from hibernation > you risk corrupted filesystem. I strongly believe that "ro" must be > really read-only You're aware that "ro" already doesn't mean "no changes to the block device" on most modern filesystems (including btrfs)? > - having separate option to control it will require > update of every tool that generates initramfs and even then you > cannot > avoid using older initramfs with newer kernel. What would the initramfs have to deal with it? Isn't the whole repairing thing completely transparent (unless perhaps you do some lo- level forensics or access the fs directly and not throught the filesystem driver)? > Safety means exactly opposite for me - do not modify data if > explicitly > requested not to do it. Depends on what you mean by data... If you mean "the files as exported in the file hierarchy"... then there won't be any modifications by repair (after all it just repairs and gives back the correct data or fails)... Cheers, Chris.
On 2016-11-28 04:37, Christoph Anton Mitterer wrote: > I think for safety it's best to repair as early as possible (and thus > on read when a damage is detected), as further blocks/devices may fail > till eventually a scrub(with repair) would be run manually. > > However, there may some workloads under which such auto-repair is > undesirable as it may cost performance and safety may be less important > than that. I am assuming that a corruption is a quite rare event. So occasionally it could happens that a page is corrupted and the system corrects it. This shouldn't have an impact on the workloads. BR G.Baroncelli
On 2016-11-28 01:40, Qu Wenruo wrote: > > At 11/27/2016 07:16 AM, Goffredo Baroncelli wrote: >> On 2016-11-26 19:54, Zygo Blaxell wrote: >>> On Sat, Nov 26, 2016 at 02:12:56PM +0100, Goffredo Baroncelli wrote: >>>> On 2016-11-25 05:31, Zygo Blaxell wrote: >> [...] >>>> >>>> BTW Btrfs in RAID1 mode corrects the data even in the read case. So >>> >>> Have you tested this? I think you'll find that it doesn't. >> >> Yes I tested it; and it does the rebuild automatically. >> I corrupted a disk of mirror, then I read the related file. The log says: >> >> [ 59.287748] BTRFS warning (device vdb): csum failed ino 257 off 0 csum 12813760 expected csum 3114703128 >> [ 59.291542] BTRFS warning (device vdb): csum failed ino 257 off 0 csum 12813760 expected csum 3114703128 >> [ 59.294950] BTRFS info (device vdb): read error corrected: ino 257 off 0 (dev /dev/vdb sector 2154496) >> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ >> >> IIRC In case of RAID5/6 the last line is missing. However in both the case the data returned is good; but in RAID1 the data is corrected also on the disk. >> >> Where you read that the data is not rebuild automatically ? >> >> In fact I was surprised that RAID5/6 behaves differently.... >> > > Yes, I also tried that and realized that RAID1 is recovering corrupted data at *READ* time. > > The main difference between RAID1 and RAID56 seems to be the complexity. > > For RAID56, we have different read/write behavior, for read, we use flag BTRFS_RBIO_READ_REBUILD, which will only rebuild data but not write them into disk. > And I'm a little concern about the race between read time fix and write. > > I assume it's possible to change the behavior to follow RAID1, but I'd like to do it in the following steps: > 1) Fix known RAID56 bugs > With the v3 patch and previous 2 patches, it seems OK now. > 2) Full fstests test case, with all possible corruption combination > (WIP) > 3) Rework current RAID56 code to a cleaner and more readable status > (long term) > 4) Add the support to fix things at read time. > > So the behavior change is not something we will see in short time. +1 I am understanding that the status of RAID5/6 code is so badly that we need to correct all the more critical bugs and then increase the tests for not regression. On the point 3, I don't know the code well enough to say something, the code is very complex. I see the point 4 as the less urgent. Let me to make a request: I would like to know your opinion about my email "RFC: raid with a variable stripe size", which started a little thread. I am asking this because you now have the hands on this code: my suggestion (use different BG, with different stripe size to avoid RMW cycles) or the Zygo's one (don't fill a stripe if you don't need to avoid RMW cycles) are difficult to implement ? BR G.Baroncelli > > Thanks, > Qu
On Mon, 2016-11-28 at 19:32 +0100, Goffredo Baroncelli wrote: > I am assuming that a corruption is a quite rare event. So > occasionally it could happens that a page is corrupted and the system > corrects it. This shouldn't have an impact on the workloads. Probably, but it still make sense to make it configurable, especially as an option like this would be needed to make a btrfs truly non- writing to the device again... The ro mount option just says that the files/permissions/etc. don't change - not that any internas doesn't change, so it would IMO be an abuse if ro would be used to disable repairs. And some people simply may want that - and if it's just to rule out the possibility of corruptions on a ro-fs in case of bad memory ;) Cheers, Chris.
On Mon, 2016-11-28 at 19:45 +0100, Goffredo Baroncelli wrote:
> I am understanding that the status of RAID5/6 code is so badly
Just some random thought:
If the code for raid56 is really as bad as it's often claimed (I
haven't read it, to be honest) .... could it perhaps make sense to
consider to start it from scratch? And/or merge it with a more generic
approach that allows n-way-parity RAIDs (I think such patch was posted
hear some year(s) ago).
Cheers,
Chris.
On 2016-11-28 14:01, Christoph Anton Mitterer wrote: > On Mon, 2016-11-28 at 19:45 +0100, Goffredo Baroncelli wrote: >> I am understanding that the status of RAID5/6 code is so badly > Just some random thought: > > If the code for raid56 is really as bad as it's often claimed (I > haven't read it, to be honest) .... could it perhaps make sense to > consider to start it from scratch? And/or merge it with a more generic > approach that allows n-way-parity RAIDs (I think such patch was posted > hear some year(s) ago). Such a suggestion for higher order parity support was made some time ago (at least a year and a half I believe, probably more). It was at the time stated that it was planned after n-way replication and raid5/6. Personally I feel that sort of road-map is misguided, as all three are functionally interdependent as the original proposal had suggested, and I'm also of the opinion that the raid5/6 code probably should be redone from scratch (I wasn't the one who wrote it, and can't contribute much more than some attempts at testing as a third-party, so I obviously can't make that decision myself). Doing so would likely make it much easier to implement higher order parity (and arbitrary striping/replication, which is something I'm _very_ interested in). The existing code isn't bad in a stylistic or even traditional coding sense, but makes a significant number of poor choices in the high-level side of things (all the issues with parity for example). If we just want working raid5/6, then working on the existing implementation is fine. If we want support for arbitrary combinations of striping/replication/parity (which would be a killer feature and something that BTRFS could actually say nothing else has), then rewriting from scratch is probably easier because of some of the low-level design choices made in the raid5/6 code. Part of the problem though is that there are more admins and support focused people than coders involved. In my case I very much do not have the expertise to work on kernel code beyond tweaking constants and twiddling default values for things (I'm technically a programmer, but 90% of the work I do is sysops type stuff written in a mix of sh, Python, and about 7 different data serialization formats). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Nov 28, 2016 at 07:32:38PM +0100, Goffredo Baroncelli wrote: > On 2016-11-28 04:37, Christoph Anton Mitterer wrote: > > I think for safety it's best to repair as early as possible (and thus > > on read when a damage is detected), as further blocks/devices may fail > > till eventually a scrub(with repair) would be run manually. > > > > However, there may some workloads under which such auto-repair is > > undesirable as it may cost performance and safety may be less important > > than that. > > I am assuming that a corruption is a quite rare event. So occasionally > it could happens that a page is corrupted and the system corrects > it. This shouldn't have an impact on the workloads. Depends heavily on the specifics of the failure case. If a drive's embedded controller RAM fails, you get corruption on the majority of reads from a single disk, and most writes will be corrupted (even if they were not before). If there's a transient failure due to environmental issues (e.g. short-term high-amplitude vibration or overheating) then writes may pause for mechanical retry loops. If there is bitrot in SSDs (particularly in the address translation tables) it looks like a wall of random noise that only ends when the disk goes offline. You can get combinations of these (e.g. RAM failures caused by transient overheating) where the drive's behavior changes over time. When in doubt, don't write. > BR > G.Baroncelli > -- > gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it> > Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 >
On Mon, 2016-11-28 at 16:48 -0500, Zygo Blaxell wrote: > If a drive's > embedded controller RAM fails, you get corruption on the majority of > reads from a single disk, and most writes will be corrupted (even if > they > were not before). Administrating a multi-PiB Tier-2 for the LHC Computing Grid with quite a number of disks for nearly 10 years now, I'd have never stumbled on such a case of breakage so far... Actually most cases are as simple as HDD fails to work and this is properly signalled to the controller. > If there's a transient failure due to environmental > issues (e.g. short-term high-amplitude vibration or overheating) then > writes may pause for mechanical retry loops. If there is bitrot in > SSDs > (particularly in the address translation tables) it looks like a wall > of random noise that only ends when the disk goes offline. You can > get > combinations of these (e.g. RAM failures caused by transient > overheating) > where the drive's behavior changes over time. > > When in doubt, don't write. Sorry, but these cases as any cases of memory issues (be it main memory or HDD controller) would also kick in at any normal writes. So there's no point in protecting against this on the storage side... Either never write at all... or have good backups for these rare cases. Cheers, Chris.
On Tue, Nov 29, 2016 at 02:52:47AM +0100, Christoph Anton Mitterer wrote: > On Mon, 2016-11-28 at 16:48 -0500, Zygo Blaxell wrote: > > If a drive's > > embedded controller RAM fails, you get corruption on the majority of > > reads from a single disk, and most writes will be corrupted (even if > > they > > were not before). > > Administrating a multi-PiB Tier-2 for the LHC Computing Grid with quite > a number of disks for nearly 10 years now, I'd have never stumbled on > such a case of breakage so far... In data centers you won't see breakages that are common on desktop and laptop drives. Laptops in particular sometimes (often?) go to places that are much less friendly to hardware. All my NAS and enterprise drives in server racks and data centers just wake up one morning stone dead or with a few well-behaved bad sectors, with none of this drama. Boring! > Actually most cases are as simple as HDD fails to work and this is > properly signalled to the controller. > > > > > If there's a transient failure due to environmental > > issues (e.g. short-term high-amplitude vibration or overheating) then > > writes may pause for mechanical retry loops. If there is bitrot in > > SSDs > > (particularly in the address translation tables) it looks like a wall > > of random noise that only ends when the disk goes offline. You can > > get > > combinations of these (e.g. RAM failures caused by transient > > overheating) > > where the drive's behavior changes over time. > > > > When in doubt, don't write. > > Sorry, but these cases as any cases of memory issues (be it main memory > or HDD controller) would also kick in at any normal writes. Yes, but in a RAID1 context there will be another disk with a good copy (or if main RAM is failing, the entire filesystem will be toast no matter what you do). > So there's no point in protecting against this on the storage side... > > Either never write at all... or have good backups for these rare cases. > > > > Cheers, > Chris.
On Tue, Nov 29, 2016 at 02:52:47AM +0100, Christoph Anton Mitterer wrote: > On Mon, 2016-11-28 at 16:48 -0500, Zygo Blaxell wrote: > > If a drive's embedded controller RAM fails, you get corruption on the > > majority of reads from a single disk, and most writes will be corrupted > > (even if they were not before). > > Administrating a multi-PiB Tier-2 for the LHC Computing Grid with quite > a number of disks for nearly 10 years now, I'd have never stumbled on > such a case of breakage so far... > > Actually most cases are as simple as HDD fails to work and this is > properly signalled to the controller. I administer no real storage at this time, and got only 16 disks (plus a few disk-likes) to my name right now. Yet in a ~2 months span I've seen three cases of silent data corruption: * a RasPi I used for DNS recursor/DHCP/aiccu started mangling some writes, with no notification that something is amiss. With ext4 being a silentdatalossfs, there was no clue it was a disk (ok, SD) problem at all, making it really "fun" to debug. Happens on multiple SD cards, thus it's the machine that's at fault. * a HDD had some link resets and silent data corruption, diagnosed to a bad SATA cable, the disk works fine since (obviously after extensive tests). * a HDD that has link resets and silent data corruption (apparently write-time only(?)), Marduk knows why. Happens with multiple cables and two machines, putting the blame somewhere on the disk. Thus, assumption that the controller will be notified about read errors is quite invalid. In the above cases, if recovery was possible it'd be beneficial to rewrite a good copy of the data. Meow!
On Tue, 2016-11-29 at 08:35 +0100, Adam Borowski wrote: > I administer no real storage at this time, and got only 16 disks > (plus a few > disk-likes) to my name right now. Yet in a ~2 months span I've seen > three > cases of silent data corruption I didn't meant to say we'd have no silent data corruption. OTOH, we cannot have had many of them, since the storage management software itself has another layer of checksumming. What I meant to say is that we likely never had such a scenario described by Zygo, where e.g. broken HDD controller memory would cause corruptions (and which would be then bad in case of auto-RAID-repair- on-read). Cause IMO such memory issues would rather soon be noticed. Cheers, Chris.
diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c index d016d4a..87e3565 100644 --- a/fs/btrfs/raid56.c +++ b/fs/btrfs/raid56.c @@ -133,6 +133,16 @@ struct btrfs_raid_bio { /* second bad stripe (for raid6 use) */ int failb; + /* + * For steal_rbio, we can steal recovered correct page, + * but in finish_parity_scrub(), we still use bad on-disk + * page to calculate parity. + * Use these members to info finish_parity_scrub() to use + * correct pages + */ + int bad_ondisk_a; + int bad_ondisk_b; + int scrubp; /* * number of pages needed to represent the full @@ -310,6 +320,12 @@ static void steal_rbio(struct btrfs_raid_bio *src, struct btrfs_raid_bio *dest) if (!test_bit(RBIO_CACHE_READY_BIT, &src->flags)) return; + /* Record recovered stripe number */ + if (src->faila != -1) + dest->bad_ondisk_a = src->faila; + if (src->failb != -1) + dest->bad_ondisk_b = src->failb; + for (i = 0; i < dest->nr_pages; i++) { s = src->stripe_pages[i]; if (!s || !PageUptodate(s)) { @@ -998,6 +1014,8 @@ static struct btrfs_raid_bio *alloc_rbio(struct btrfs_root *root, rbio->stripe_npages = stripe_npages; rbio->faila = -1; rbio->failb = -1; + rbio->bad_ondisk_a = -1; + rbio->bad_ondisk_b = -1; atomic_set(&rbio->refs, 1); atomic_set(&rbio->error, 0); atomic_set(&rbio->stripes_pending, 0); @@ -2352,7 +2370,16 @@ static noinline void finish_parity_scrub(struct btrfs_raid_bio *rbio, void *parity; /* first collect one page from each data stripe */ for (stripe = 0; stripe < nr_data; stripe++) { - p = page_in_rbio(rbio, stripe, pagenr, 0); + + /* + * Use stolen recovered page other than bad + * on disk pages + */ + if (stripe == rbio->bad_ondisk_a || + stripe == rbio->bad_ondisk_b) + p = rbio_stripe_page(rbio, stripe, pagenr); + else + p = page_in_rbio(rbio, stripe, pagenr, 0); pointers[stripe] = kmap(p); }
In the following situation, scrub will calculate wrong parity to overwrite correct one: RAID5 full stripe: Before | Dev 1 | Dev 2 | Dev 3 | | Data stripe 1 | Data stripe 2 | Parity Stripe | --------------------------------------------------- 0 | 0x0000 (Bad) | 0xcdcd | 0x0000 | --------------------------------------------------- 4K | 0xcdcd | 0xcdcd | 0x0000 | ... | 0xcdcd | 0xcdcd | 0x0000 | --------------------------------------------------- 64K After scrubbing dev3 only: | Dev 1 | Dev 2 | Dev 3 | | Data stripe 1 | Data stripe 2 | Parity Stripe | --------------------------------------------------- 0 | 0xcdcd (Good) | 0xcdcd | 0xcdcd (Bad) | --------------------------------------------------- 4K | 0xcdcd | 0xcdcd | 0x0000 | ... | 0xcdcd | 0xcdcd | 0x0000 | --------------------------------------------------- 64K The calltrace of such corruption is as following: scrub_bio_end_io_worker() get called for each extent read out |- scriub_block_complete() |- Data extent csum mismatch |- scrub_handle_errored_block |- scrub_recheck_block() |- scrub_submit_raid56_bio_wait() |- raid56_parity_recover() Now we have a rbio with correct data stripe 1 recovered. Let's call it "good_rbio". scrub_parity_check_and_repair() |- raid56_parity_submit_scrub_rbio() |- lock_stripe_add() | |- steal_rbio() | |- Recovered data are steal from "good_rbio", stored into | rbio->stripe_pages[] | Now rbio->bio_pages[] are bad data read from disk. |- async_scrub_parity() |- scrub_parity_work() (delayed_call to scrub_parity_work) scrub_parity_work() |- raid56_parity_scrub_stripe() |- validate_rbio_for_parity_scrub() |- finish_parity_scrub() |- Recalculate parity using *BAD* pages in rbio->bio_pages[] So good parity is overwritten with *BAD* one The fix is to introduce 2 new members, bad_ondisk_a/b, to struct btrfs_raid_bio, to info scrub code to use correct data pages to re-calculate parity. Reported-by: Goffredo Baroncelli <kreijack@inwind.it> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> --- Thanks to the above hell of delayed function all and damn stupid code logical, such bug is quite hard to trace. The damn kernel scrub is already multi-thread, why do such meaningless delayed function call again and again? What's wrong with single thread scrub? We can do thing like in each stripe for raid56 which is easy and straightforward, only delayed thing is to wake up waiter: lock_full_stripe() if (!is_parity_stripe()) { prepare_data_stripe_bios() submit_and_wait_bios() if (check_csum() == 0) goto out; } prepare_full_stripe_bios() submit_and_wait_bios() recover_raid56_stipres(); prepare_full_stripe_write_bios() submit_and_wait_bios() out: unlock_full_stripe() We really need to re-work the whole damn scrub code. Also, we need to enhance btrfs-progs to detect scrub problem(my submitted offline scrub is good enough for such usage), and tools to corrupt extents reliably to put it into xfstests test cases. RAID56 scrub code is neither tested nor well-designed. --- fs/btrfs/raid56.c | 29 ++++++++++++++++++++++++++++- 1 file changed, 28 insertions(+), 1 deletion(-)