Message ID | 20171102005405.20420-3-bo.li.liu@oracle.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On 11/02/2017 08:54 AM, Liu Bo wrote: > With raid6 profile, btrfs can end up with data corruption by the > following steps. > > Say we have a 5 disks that are set up with raid6 profile, > > 1) mount this btrfs > 2) one disk gets pulled out > 3) write something to btrfs and sync > 4) another disk gets pulled out > 5) write something to btrfs and sync > 6) umount this btrfs > 7) bring the two disk back > 8) reboot > 9) mount this btrfs > > Chances are mount will fail (even with -odegraded) because of failure > on reading metadata blocks, IOW, our raid6 setup is not able to > protect two disk failures in some cases. > > So it turns out that there is a bug in raid6's recover code, that is, > > if we have one of stripes in the raid6 layout as shown here, > > | D1(*) | D2(*) | D3 | D4 | D5 | > ------------------------------------- > | data1 | data2 | P | Q | data0 | > D1 and D2 are the two disks which got pulled out and brought back. > When mount reads data1 and finds out that data1 doesn't match its crc, > btrfs goes to recover data1 from other stripes, what it'll be doing is > > 1) read data2, parity P, parity Q and data0 > > 2) as we have a valid parity P and two data stripes, it goes to > recover in raid5 style. > (However, since disk D2 was pulled out and data2 on D2 could be stale, data2 should end up crc error, if not then raid5 recover is as expected OR this example is confusing to explain the context of two data stipe missing. Thanks, Anand > we still get the wrong data1 from this reconstruction.) > > 3) btrfs continues to try to reconstruct data1 from parity Q, data2 > and data0, we still get the wrong one for the same reason. > > The fix here is to take advantage of the device flag, ie. 'In_sync', > all data on a device might be stale if 'In_sync' has not been set. > > With this, we can build the correct data2 from parity P, Q and data1. > > Signed-off-by: Liu Bo <bo.li.liu@oracle.com> > --- > fs/btrfs/raid56.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c > index 67262f8..3c0ce61 100644 > --- a/fs/btrfs/raid56.c > +++ b/fs/btrfs/raid56.c > @@ -1076,7 +1076,8 @@ static int rbio_add_io_page(struct btrfs_raid_bio *rbio, > disk_start = stripe->physical + (page_index << PAGE_SHIFT); > > /* if the device is missing, just fail this stripe */ > - if (!stripe->dev->bdev) > + if (!stripe->dev->bdev || > + !test_bit(In_sync, &stripe->dev->flags)) > return fail_rbio_index(rbio, stripe_nr); > > /* see if we can add this page onto our existing bio */ > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Nov 07, 2017 at 04:32:55PM +0800, Anand Jain wrote: > > > On 11/02/2017 08:54 AM, Liu Bo wrote: > >With raid6 profile, btrfs can end up with data corruption by the > >following steps. > > > >Say we have a 5 disks that are set up with raid6 profile, > > > >1) mount this btrfs > >2) one disk gets pulled out > >3) write something to btrfs and sync > >4) another disk gets pulled out > >5) write something to btrfs and sync > >6) umount this btrfs > >7) bring the two disk back > >8) reboot > >9) mount this btrfs > > > >Chances are mount will fail (even with -odegraded) because of failure > >on reading metadata blocks, IOW, our raid6 setup is not able to > >protect two disk failures in some cases. > > > >So it turns out that there is a bug in raid6's recover code, that is, > > > >if we have one of stripes in the raid6 layout as shown here, > > > >| D1(*) | D2(*) | D3 | D4 | D5 | > >------------------------------------- > >| data1 | data2 | P | Q | data0 | > > > >D1 and D2 are the two disks which got pulled out and brought back. > >When mount reads data1 and finds out that data1 doesn't match its crc, > >btrfs goes to recover data1 from other stripes, what it'll be doing is > > > >1) read data2, parity P, parity Q and data0 > > > >2) as we have a valid parity P and two data stripes, it goes to > > recover in raid5 style. > > > >(However, since disk D2 was pulled out and data2 on D2 could be stale, > > data2 should end up crc error, if not then raid5 recover is as > expected OR this example is confusing to explain the context of > two data stipe missing. The assumption you have is not true, when doing reconstruction, we don't verify checksum for each data stripe that is read from disk. Thanks, -liubo > > Thanks, Anand > > > >we still get the wrong data1 from this reconstruction.) > > > >3) btrfs continues to try to reconstruct data1 from parity Q, data2 > > and data0, we still get the wrong one for the same reason. > > > >The fix here is to take advantage of the device flag, ie. 'In_sync', > >all data on a device might be stale if 'In_sync' has not been set. > > > >With this, we can build the correct data2 from parity P, Q and data1. > > > >Signed-off-by: Liu Bo <bo.li.liu@oracle.com> > >--- > > fs/btrfs/raid56.c | 3 ++- > > 1 file changed, 2 insertions(+), 1 deletion(-) > > > >diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c > >index 67262f8..3c0ce61 100644 > >--- a/fs/btrfs/raid56.c > >+++ b/fs/btrfs/raid56.c > >@@ -1076,7 +1076,8 @@ static int rbio_add_io_page(struct btrfs_raid_bio *rbio, > > disk_start = stripe->physical + (page_index << PAGE_SHIFT); > > /* if the device is missing, just fail this stripe */ > >- if (!stripe->dev->bdev) > >+ if (!stripe->dev->bdev || > >+ !test_bit(In_sync, &stripe->dev->flags)) > > return fail_rbio_index(rbio, stripe_nr); > > /* see if we can add this page onto our existing bio */ > > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 11/09/2017 03:53 AM, Liu Bo wrote: > On Tue, Nov 07, 2017 at 04:32:55PM +0800, Anand Jain wrote: >> >> >> On 11/02/2017 08:54 AM, Liu Bo wrote: >>> With raid6 profile, btrfs can end up with data corruption by the >>> following steps. >>> >>> Say we have a 5 disks that are set up with raid6 profile, >>> >>> 1) mount this btrfs >>> 2) one disk gets pulled out >>> 3) write something to btrfs and sync >>> 4) another disk gets pulled out >>> 5) write something to btrfs and sync >>> 6) umount this btrfs >>> 7) bring the two disk back >>> 8) reboot >>> 9) mount this btrfs >>> >>> Chances are mount will fail (even with -odegraded) because of failure >>> on reading metadata blocks, IOW, our raid6 setup is not able to >>> protect two disk failures in some cases. >>> >>> So it turns out that there is a bug in raid6's recover code, that is, >>> >>> if we have one of stripes in the raid6 layout as shown here, >>> >>> | D1(*) | D2(*) | D3 | D4 | D5 | >>> ------------------------------------- >>> | data1 | data2 | P | Q | data0 | >> >> >>> D1 and D2 are the two disks which got pulled out and brought back. >>> When mount reads data1 and finds out that data1 doesn't match its crc, >>> btrfs goes to recover data1 from other stripes, what it'll be doing is >>> >>> 1) read data2, parity P, parity Q and data0 >>> >>> 2) as we have a valid parity P and two data stripes, it goes to >>> recover in raid5 style. >> >> >>> (However, since disk D2 was pulled out and data2 on D2 could be stale, >> >> data2 should end up crc error, if not then raid5 recover is as >> expected OR this example is confusing to explain the context of >> two data stipe missing. > > The assumption you have is not true, > when doing reconstruction, we > don't verify checksum for each data stripe that is read from disk. Why ? what if data0 (which has InSync flag) was wrong ? I am wary about the In_sync approach. IMO we are loosing the advantage of FS being merged with volume manager - which is to know exactly which active stripes were lost for quicker reconstruct. Lets say RAID6 is 50% full and disk1 is pulled out, and at 75% full disk2 is pulled out and reaches 90% full. So now when all the disks are back, two disks will not have In_sync flag? hope I understood this correctly. Now. 15% of the data have lost two stripes. 25% of the data have lost one stripe. 50% of the data did not loose any stripe at all. Now for read and reconstruct.. 50% of the data does not need any reconstruction. 25% of the data needs RAID5 style reconstruction as they have lost only one stripe. 15% of the data needs RAID6 reconstruction since they have lost two stripes. Does InSync design here help to work like this ? Now for RAID1, lets say disk1 lost first 10% of the stripes and disk2 lost the last 10% of the stripes and these stripes are mutually exclusive. That means there is no single stripe which has lost completely. There is no code yet where I can see when the In_sync will be reset, which is fine for this eg. Assume no reconstruction is done. But now when both the disks are being mounted and as the In_sync will be based on dev_item.generation, one of it will get In_sync to me it looks like there will be 10% of stripes which will fail even though its not practically true in this example. Thanks, Anand > Thanks, > > -liubo >> >> Thanks, Anand >> >> >>> we still get the wrong data1 from this reconstruction.) >>> >>> 3) btrfs continues to try to reconstruct data1 from parity Q, data2 >>> and data0, we still get the wrong one for the same reason. >>> >>> The fix here is to take advantage of the device flag, ie. 'In_sync', >>> all data on a device might be stale if 'In_sync' has not been set. >>> >>> With this, we can build the correct data2 from parity P, Q and data1. >>> >>> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> >>> --- >>> fs/btrfs/raid56.c | 3 ++- >>> 1 file changed, 2 insertions(+), 1 deletion(-) >>> >>> diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c >>> index 67262f8..3c0ce61 100644 >>> --- a/fs/btrfs/raid56.c >>> +++ b/fs/btrfs/raid56.c >>> @@ -1076,7 +1076,8 @@ static int rbio_add_io_page(struct btrfs_raid_bio *rbio, >>> disk_start = stripe->physical + (page_index << PAGE_SHIFT); >>> /* if the device is missing, just fail this stripe */ >>> - if (!stripe->dev->bdev) >>> + if (!stripe->dev->bdev || >>> + !test_bit(In_sync, &stripe->dev->flags)) >>> return fail_rbio_index(rbio, stripe_nr); >>> /* see if we can add this page onto our existing bio */ >>> > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 2017年11月09日 17:29, Anand Jain wrote: > > > On 11/09/2017 03:53 AM, Liu Bo wrote: >> On Tue, Nov 07, 2017 at 04:32:55PM +0800, Anand Jain wrote: >>> >>> >>> On 11/02/2017 08:54 AM, Liu Bo wrote: >>>> With raid6 profile, btrfs can end up with data corruption by the >>>> following steps. >>>> >>>> Say we have a 5 disks that are set up with raid6 profile, >>>> >>>> 1) mount this btrfs >>>> 2) one disk gets pulled out >>>> 3) write something to btrfs and sync >>>> 4) another disk gets pulled out >>>> 5) write something to btrfs and sync >>>> 6) umount this btrfs >>>> 7) bring the two disk back >>>> 8) reboot >>>> 9) mount this btrfs >>>> >>>> Chances are mount will fail (even with -odegraded) because of failure >>>> on reading metadata blocks, IOW, our raid6 setup is not able to >>>> protect two disk failures in some cases. >>>> >>>> So it turns out that there is a bug in raid6's recover code, that is, >>>> >>>> if we have one of stripes in the raid6 layout as shown here, >>>> >>>> | D1(*) | D2(*) | D3 | D4 | D5 | >>>> ------------------------------------- >>>> | data1 | data2 | P | Q | data0 | >>> >>> >>>> D1 and D2 are the two disks which got pulled out and brought back. >>>> When mount reads data1 and finds out that data1 doesn't match its crc, >>>> btrfs goes to recover data1 from other stripes, what it'll be doing is >>>> >>>> 1) read data2, parity P, parity Q and data0 >>>> >>>> 2) as we have a valid parity P and two data stripes, it goes to >>>> recover in raid5 style. >>> >>> >>>> (However, since disk D2 was pulled out and data2 on D2 could be stale, >>> >>> data2 should end up crc error, if not then raid5 recover is as >>> expected OR this example is confusing to explain the context of >>> two data stipe missing. >> >> The assumption you have is not true, > >> when doing reconstruction, we >> don't verify checksum for each data stripe that is read from disk. I have the question, why not verifying the data stripe's csum when doing reconstruction? I know it can be nasty for metadata, since until we fully rebuild the stripe we won't be able to check if the reconstruction is correct. (And if we reconstruction failed, we should try to rebuild the block using other untried methods) But for data (with csum), I think it's possible to verify the csum at reconstruction time, and make allow btrfs to have better understanding that which repair method should be applied. So at least for data with csum (the easiest case), reconstruction should be done for case D1 corruption: | D1(*) | D2(*) | D3 | D4 | D5 | ------------------------------------- | data1 | data2 | P | Q | data0 | 0) D1 csum mismatch Goto repair procedure 1) Read out csum for data0~data2 2) Read out data0~data2, P, Q 3) Verify data0~data2 And both data1 and 2 fails the check 4) Repair using P Q data0 5) Recheck csum If data1/2 matches csum, repair is done well. If data1/2 mismatches csum, return EIO. Thanks, Qu > > Why ? what if data0 (which has InSync flag) was wrong ? > > I am wary about the In_sync approach. IMO we are loosing the > advantage of FS being merged with volume manager - which is to > know exactly which active stripes were lost for quicker reconstruct. > > Lets say RAID6 is 50% full and disk1 is pulled out, and at 75% > full disk2 is pulled out and reaches 90% full. > > So now when all the disks are back, two disks will not have > In_sync flag? hope I understood this correctly. > > Now. > 15% of the data have lost two stripes. > 25% of the data have lost one stripe. > 50% of the data did not loose any stripe at all. > > Now for read and reconstruct.. > 50% of the data does not need any reconstruction. > 25% of the data needs RAID5 style reconstruction as they have lost only > one stripe. > 15% of the data needs RAID6 reconstruction since they have lost two > stripes. > > Does InSync design here help to work like this ? > > Now for RAID1, lets say disk1 lost first 10% of the stripes and > disk2 lost the last 10% of the stripes and these stripes are mutually > exclusive. That means there is no single stripe which has lost > completely. > > There is no code yet where I can see when the In_sync will be reset, > which is fine for this eg. Assume no reconstruction is done. But now > when both the disks are being mounted and as the In_sync will be based > on dev_item.generation, one of it will get In_sync to me it looks like > there will be 10% of stripes which will fail even though its not > practically true in this example. > > > Thanks, Anand > > > >> Thanks, >> >> -liubo >>> >>> Thanks, Anand >>> >>> >>>> we still get the wrong data1 from this reconstruction.) >>>> >>>> 3) btrfs continues to try to reconstruct data1 from parity Q, data2 >>>> and data0, we still get the wrong one for the same reason. >>>> >>>> The fix here is to take advantage of the device flag, ie. 'In_sync', >>>> all data on a device might be stale if 'In_sync' has not been set. >>>> >>>> With this, we can build the correct data2 from parity P, Q and data1. >>>> >>>> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> >>>> --- >>>> fs/btrfs/raid56.c | 3 ++- >>>> 1 file changed, 2 insertions(+), 1 deletion(-) >>>> >>>> diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c >>>> index 67262f8..3c0ce61 100644 >>>> --- a/fs/btrfs/raid56.c >>>> +++ b/fs/btrfs/raid56.c >>>> @@ -1076,7 +1076,8 @@ static int rbio_add_io_page(struct >>>> btrfs_raid_bio *rbio, >>>> disk_start = stripe->physical + (page_index << PAGE_SHIFT); >>>> /* if the device is missing, just fail this stripe */ >>>> - if (!stripe->dev->bdev) >>>> + if (!stripe->dev->bdev || >>>> + !test_bit(In_sync, &stripe->dev->flags)) >>>> return fail_rbio_index(rbio, stripe_nr); >>>> /* see if we can add this page onto our existing bio */ >>>> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Nov 09, 2017 at 05:29:25PM +0800, Anand Jain wrote: > > > On 11/09/2017 03:53 AM, Liu Bo wrote: > >On Tue, Nov 07, 2017 at 04:32:55PM +0800, Anand Jain wrote: > >> > >> > >>On 11/02/2017 08:54 AM, Liu Bo wrote: > >>>With raid6 profile, btrfs can end up with data corruption by the > >>>following steps. > >>> > >>>Say we have a 5 disks that are set up with raid6 profile, > >>> > >>>1) mount this btrfs > >>>2) one disk gets pulled out > >>>3) write something to btrfs and sync > >>>4) another disk gets pulled out > >>>5) write something to btrfs and sync > >>>6) umount this btrfs > >>>7) bring the two disk back > >>>8) reboot > >>>9) mount this btrfs > >>> > >>>Chances are mount will fail (even with -odegraded) because of failure > >>>on reading metadata blocks, IOW, our raid6 setup is not able to > >>>protect two disk failures in some cases. > >>> > >>>So it turns out that there is a bug in raid6's recover code, that is, > >>> > >>>if we have one of stripes in the raid6 layout as shown here, > >>> > >>>| D1(*) | D2(*) | D3 | D4 | D5 | > >>>------------------------------------- > >>>| data1 | data2 | P | Q | data0 | > >> > >> > >>>D1 and D2 are the two disks which got pulled out and brought back. > >>>When mount reads data1 and finds out that data1 doesn't match its crc, > >>>btrfs goes to recover data1 from other stripes, what it'll be doing is > >>> > >>>1) read data2, parity P, parity Q and data0 > >>> > >>>2) as we have a valid parity P and two data stripes, it goes to > >>> recover in raid5 style. > >> > >> > >>>(However, since disk D2 was pulled out and data2 on D2 could be stale, > >> > >> data2 should end up crc error, if not then raid5 recover is as > >> expected OR this example is confusing to explain the context of > >> two data stipe missing. > > > >The assumption you have is not true, > > >when doing reconstruction, we > >don't verify checksum for each data stripe that is read from disk. > > Why ? what if data0 (which has InSync flag) was wrong ? > (Wrinting these took me the whole afternoon, hopefully it makes sense...) Checksum verification happens much later, in readpage's end io, and if data0 is wrong, then the rebuilt data couldn't match its checksum. As Qu Wenruo also has the same question, here is my input about why checksum is not done, a) We may have mutliple different raid profiles within one btrfs, and checksum tree also resides somewhere on the disks, for raid1(default) and raid5, two disk failures may end up with data loss, it applies to raid6 as well because of the bug this patch is addressing. b) If this is a metadata stripe, then it's likely that the stale content still matches with its checksum as metadata blocks store their checksum in the header part, thus checksum verification can not be trusted. c) If this is a data stripe, then chances are they're belonging to nocow/nodatasum inodes such that no checksum could be verified. In case that checksum is available, yes, finally we can verify checksum, but keep in mind, the whole data stripe may consist of several pieces of data which are from different inodes, some may nodatacow/nodatasum, some may not. Given all the troubles it could lead to, I'd like to avoid it. As a raid6 system, 2 disks failures can always be tolerated by rebuilding data from other good copies. Now that we have known two things, i.e. a) data1 fails its checksum verification and b) data2 from a out-of-sync disk might be stale, it's straightforward to rebuild data1 from other good copies. > I am wary about the In_sync approach. IMO we are loosing the > advantage of FS being merged with volume manager - which is to > know exactly which active stripes were lost for quicker reconstruct. > Thanks for pointing this out, I'd say filesystem still has some advantages over a vanilla volume manager while resync-ing out-of-sync disks. > Lets say RAID6 is 50% full and disk1 is pulled out, and at 75% > full disk2 is pulled out and reaches 90% full. > > So now when all the disks are back, two disks will not have > In_sync flag? hope I understood this correctly. > > Now. > 15% of the data have lost two stripes. > 25% of the data have lost one stripe. > 50% of the data did not loose any stripe at all. > > Now for read and reconstruct.. > 50% of the data does not need any reconstruction. A 'Yes' for this, actually I should have avoided reading from out-of-sync disks, but I forgot to do so. So as a side effect, if the data read out from a out-of-sync disk matches its checksum, then it's proven to be OK. > 25% of the data needs RAID5 style reconstruction as they have lost only one > stripe. > 15% of the data needs RAID6 reconstruction since they have lost two > stripes. > It depends, the stripes on the out-of-sync disks could be a) all data stripes(as the example in this patch) b) one data stripe and one stripe P c) one data stripe and one stripe Q with In_sync flag, for a) raid6 reconstruction, for b) raid6 reconstruction, for c) raid5 reconstruction. > Does InSync design here help to work like this ? > > Now for RAID1, lets say disk1 lost first 10% of the stripes and > disk2 lost the last 10% of the stripes and these stripes are mutually > exclusive. That means there is no single stripe which has lost > completely. > True, probably we can drop patch 3 to let read happen on out-of-sync disks and go the existing repair routine. > There is no code yet where I can see when the In_sync will be reset, > which is fine for this eg. Assume no reconstruction is done. But now > when both the disks are being mounted and as the In_sync will be based > on dev_item.generation, one of it will get In_sync to me it looks like > there will be 10% of stripes which will fail even though its not > practically true in this example. > Not sure if I understand this paragraph, are you talking about raid1 or raid6? Regarding to when to reset In_sync, my naive idea is to utilize scrub to verify checksum on all content that filesystem knows and do the repair work, If it turns out that everything looks good, then we know the disk is already resync, it's a good time to set In_sync. Thanks, -liubo -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 2017年11月10日 08:12, Liu Bo wrote: > On Thu, Nov 09, 2017 at 05:29:25PM +0800, Anand Jain wrote: >> >> >> On 11/09/2017 03:53 AM, Liu Bo wrote: >>> On Tue, Nov 07, 2017 at 04:32:55PM +0800, Anand Jain wrote: >>>> >>>> >>>> On 11/02/2017 08:54 AM, Liu Bo wrote: >>>>> With raid6 profile, btrfs can end up with data corruption by the >>>>> following steps. >>>>> >>>>> Say we have a 5 disks that are set up with raid6 profile, >>>>> >>>>> 1) mount this btrfs >>>>> 2) one disk gets pulled out >>>>> 3) write something to btrfs and sync >>>>> 4) another disk gets pulled out >>>>> 5) write something to btrfs and sync >>>>> 6) umount this btrfs >>>>> 7) bring the two disk back >>>>> 8) reboot >>>>> 9) mount this btrfs >>>>> >>>>> Chances are mount will fail (even with -odegraded) because of failure >>>>> on reading metadata blocks, IOW, our raid6 setup is not able to >>>>> protect two disk failures in some cases. >>>>> >>>>> So it turns out that there is a bug in raid6's recover code, that is, >>>>> >>>>> if we have one of stripes in the raid6 layout as shown here, >>>>> >>>>> | D1(*) | D2(*) | D3 | D4 | D5 | >>>>> ------------------------------------- >>>>> | data1 | data2 | P | Q | data0 | >>>> >>>> >>>>> D1 and D2 are the two disks which got pulled out and brought back. >>>>> When mount reads data1 and finds out that data1 doesn't match its crc, >>>>> btrfs goes to recover data1 from other stripes, what it'll be doing is >>>>> >>>>> 1) read data2, parity P, parity Q and data0 >>>>> >>>>> 2) as we have a valid parity P and two data stripes, it goes to >>>>> recover in raid5 style. >>>> >>>> >>>>> (However, since disk D2 was pulled out and data2 on D2 could be stale, >>>> >>>> data2 should end up crc error, if not then raid5 recover is as >>>> expected OR this example is confusing to explain the context of >>>> two data stipe missing. >>> >>> The assumption you have is not true, >> >>> when doing reconstruction, we >>> don't verify checksum for each data stripe that is read from disk. >> >> Why ? what if data0 (which has InSync flag) was wrong ? >> > > (Wrinting these took me the whole afternoon, hopefully it makes > sense...) > > Checksum verification happens much later, in readpage's end io, and if > data0 is wrong, then the rebuilt data couldn't match its checksum. > > As Qu Wenruo also has the same question, here is my input about why > checksum is not done, > > a) We may have mutliple different raid profiles within one btrfs, and > checksum tree also resides somewhere on the disks, for > raid1(default) and raid5, two disk failures may end up with data > loss, it applies to raid6 as well because of the bug this patch is > addressing. However, raid56 is more complex than other profiles, so it even has its own raid56.[ch]. Although currently it's mostly for the complex RMW cycle, I think it can also be enhanced to do csum verification at repair time. > > b) If this is a metadata stripe, then it's likely that the stale > content still matches with its checksum as metadata blocks store > their checksum in the header part, thus checksum verification can > not be trusted. Metadata can be handled later, it's much more complex since (by default) metadata crosses several pages, which makes it impossible to verify until we rebuild all related pages. BTW, if metadata (tree block) has stale content, it will not match the csum. Tree block csum is for the whole tree block, so any stale data in the whole tree block will cause mismatched csum. It's hard to do just because we can't verify if the rebuilt/original tree block is good until we rebuild/read the whole tree block. > > c) If this is a data stripe, then chances are they're belonging to > nocow/nodatasum inodes such that no checksum could be verified. In > case that checksum is available, yes, finally we can verify > checksum, but keep in mind, the whole data stripe may consist of > several pieces of data which are from different inodes, some may > nodatacow/nodatasum, some may not. Ignore such case too. > > Given all the troubles it could lead to, I'd like to avoid it. For hard parts, I think keeping them untouched until good solution is found is completely acceptable. No need to do it all-in-one, step by step should be the correct way. > > As a raid6 system, 2 disks failures can always be tolerated by > rebuilding data from other good copies. > > Now that we have known two things, i.e. > a) data1 fails its checksum verification and > b) data2 from a out-of-sync disk might be stale, > it's straightforward to rebuild data1 from other good copies. But the out-of-sync flag is too generic to determine if we can trust the copy. Csum here is a much better indicator, which can provide page level (for data) indicator other than device level. For case 3 devices out-of-sync, we just treat it as 3 missing devices? Thanks, Qu > > >> I am wary about the In_sync approach. IMO we are loosing the >> advantage of FS being merged with volume manager - which is to >> know exactly which active stripes were lost for quicker reconstruct. >> > > Thanks for pointing this out, I'd say filesystem still has some > advantages over a vanilla volume manager while resync-ing out-of-sync > disks. > >> Lets say RAID6 is 50% full and disk1 is pulled out, and at 75% >> full disk2 is pulled out and reaches 90% full. >> >> So now when all the disks are back, two disks will not have >> In_sync flag? hope I understood this correctly. >> >> Now. >> 15% of the data have lost two stripes. >> 25% of the data have lost one stripe. >> 50% of the data did not loose any stripe at all. >> >> Now for read and reconstruct.. >> 50% of the data does not need any reconstruction. > > A 'Yes' for this, actually I should have avoided reading from > out-of-sync disks, but I forgot to do so. So as a side effect, if the > data read out from a out-of-sync disk matches its checksum, then it's > proven to be OK. > >> 25% of the data needs RAID5 style reconstruction as they have lost only one >> stripe. >> 15% of the data needs RAID6 reconstruction since they have lost two >> stripes. >> > > It depends, the stripes on the out-of-sync disks could be > a) all data stripes(as the example in this patch) > b) one data stripe and one stripe P > c) one data stripe and one stripe Q > > with In_sync flag, > for a) raid6 reconstruction, > for b) raid6 reconstruction, > for c) raid5 reconstruction. > >> Does InSync design here help to work like this ? >> >> Now for RAID1, lets say disk1 lost first 10% of the stripes and >> disk2 lost the last 10% of the stripes and these stripes are mutually >> exclusive. That means there is no single stripe which has lost >> completely. >> > > True, probably we can drop patch 3 to let read happen on out-of-sync > disks and go the existing repair routine. > >> There is no code yet where I can see when the In_sync will be reset, >> which is fine for this eg. Assume no reconstruction is done. But now >> when both the disks are being mounted and as the In_sync will be based >> on dev_item.generation, one of it will get In_sync to me it looks like >> there will be 10% of stripes which will fail even though its not >> practically true in this example. >> > > Not sure if I understand this paragraph, are you talking about raid1 > or raid6? > > Regarding to when to reset In_sync, my naive idea is to utilize scrub > to verify checksum on all content that filesystem knows and do the > repair work, If it turns out that everything looks good, then we know > the disk is already resync, it's a good time to set In_sync. > > Thanks, > > -liubo > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html >
On 2017年11月10日 08:54, Qu Wenruo wrote: > > > On 2017年11月10日 08:12, Liu Bo wrote: >> On Thu, Nov 09, 2017 at 05:29:25PM +0800, Anand Jain wrote: >>> >>> >>> On 11/09/2017 03:53 AM, Liu Bo wrote: >>>> On Tue, Nov 07, 2017 at 04:32:55PM +0800, Anand Jain wrote: >>>>> >>>>> >>>>> On 11/02/2017 08:54 AM, Liu Bo wrote: >>>>>> With raid6 profile, btrfs can end up with data corruption by the >>>>>> following steps. >>>>>> >>>>>> Say we have a 5 disks that are set up with raid6 profile, >>>>>> >>>>>> 1) mount this btrfs >>>>>> 2) one disk gets pulled out >>>>>> 3) write something to btrfs and sync >>>>>> 4) another disk gets pulled out >>>>>> 5) write something to btrfs and sync >>>>>> 6) umount this btrfs >>>>>> 7) bring the two disk back >>>>>> 8) reboot >>>>>> 9) mount this btrfs >>>>>> >>>>>> Chances are mount will fail (even with -odegraded) because of failure >>>>>> on reading metadata blocks, IOW, our raid6 setup is not able to >>>>>> protect two disk failures in some cases. >>>>>> >>>>>> So it turns out that there is a bug in raid6's recover code, that is, >>>>>> >>>>>> if we have one of stripes in the raid6 layout as shown here, >>>>>> >>>>>> | D1(*) | D2(*) | D3 | D4 | D5 | >>>>>> ------------------------------------- >>>>>> | data1 | data2 | P | Q | data0 | >>>>> >>>>> >>>>>> D1 and D2 are the two disks which got pulled out and brought back. >>>>>> When mount reads data1 and finds out that data1 doesn't match its crc, >>>>>> btrfs goes to recover data1 from other stripes, what it'll be doing is >>>>>> >>>>>> 1) read data2, parity P, parity Q and data0 >>>>>> >>>>>> 2) as we have a valid parity P and two data stripes, it goes to >>>>>> recover in raid5 style. >>>>> >>>>> >>>>>> (However, since disk D2 was pulled out and data2 on D2 could be stale, >>>>> >>>>> data2 should end up crc error, if not then raid5 recover is as >>>>> expected OR this example is confusing to explain the context of >>>>> two data stipe missing. >>>> >>>> The assumption you have is not true, >>> >>>> when doing reconstruction, we >>>> don't verify checksum for each data stripe that is read from disk. >>> >>> Why ? what if data0 (which has InSync flag) was wrong ? >>> >> >> (Wrinting these took me the whole afternoon, hopefully it makes >> sense...) >> >> Checksum verification happens much later, in readpage's end io, and if >> data0 is wrong, then the rebuilt data couldn't match its checksum. >> >> As Qu Wenruo also has the same question, here is my input about why >> checksum is not done, >> >> a) We may have mutliple different raid profiles within one btrfs, and >> checksum tree also resides somewhere on the disks, for >> raid1(default) and raid5, two disk failures may end up with data >> loss, it applies to raid6 as well because of the bug this patch is >> addressing. > > However, raid56 is more complex than other profiles, so it even has its > own raid56.[ch]. > Although currently it's mostly for the complex RMW cycle, I think it can > also be enhanced to do csum verification at repair time. BTW, repair can even be done without verifying csum of data2/data0. Just try all repair combination, until the repair result matches checksum, or no combination can result a good repair. I know this sounds stupid, but at least this means it's possible. The main point here is, we should at least ensure repair result matches checksum. This also reminds me that, checksum verification should be handled at lower level, other than doing it at endio time. (Not related to this topic, but I also want to implement optional csum check hook in bio, so for case like dm-integrity on dm-raid1 can recognize bad and good copy, and even let btrfs to reuse device mapper to do chunk map, but that's another story) Thanks, Qu > >> >> b) If this is a metadata stripe, then it's likely that the stale >> content still matches with its checksum as metadata blocks store >> their checksum in the header part, thus checksum verification can >> not be trusted. > > Metadata can be handled later, it's much more complex since (by default) > metadata crosses several pages, which makes it impossible to verify > until we rebuild all related pages. > > BTW, if metadata (tree block) has stale content, it will not match the csum. > Tree block csum is for the whole tree block, so any stale data in the > whole tree block will cause mismatched csum. > It's hard to do just because we can't verify if the rebuilt/original > tree block is good until we rebuild/read the whole tree block. > >> >> c) If this is a data stripe, then chances are they're belonging to >> nocow/nodatasum inodes such that no checksum could be verified. In >> case that checksum is available, yes, finally we can verify >> checksum, but keep in mind, the whole data stripe may consist of >> several pieces of data which are from different inodes, some may >> nodatacow/nodatasum, some may not. > > Ignore such case too. > >> >> Given all the troubles it could lead to, I'd like to avoid it. > > For hard parts, I think keeping them untouched until good solution is > found is completely acceptable. > > No need to do it all-in-one, step by step should be the correct way. > >> >> As a raid6 system, 2 disks failures can always be tolerated by >> rebuilding data from other good copies. >> >> Now that we have known two things, i.e. >> a) data1 fails its checksum verification and >> b) data2 from a out-of-sync disk might be stale, >> it's straightforward to rebuild data1 from other good copies. > > But the out-of-sync flag is too generic to determine if we can trust the > copy. > Csum here is a much better indicator, which can provide page level (for > data) indicator other than device level. > > For case 3 devices out-of-sync, we just treat it as 3 missing devices? > > Thanks, > Qu >> >> >>> I am wary about the In_sync approach. IMO we are loosing the >>> advantage of FS being merged with volume manager - which is to >>> know exactly which active stripes were lost for quicker reconstruct. >>> >> >> Thanks for pointing this out, I'd say filesystem still has some >> advantages over a vanilla volume manager while resync-ing out-of-sync >> disks. >> >>> Lets say RAID6 is 50% full and disk1 is pulled out, and at 75% >>> full disk2 is pulled out and reaches 90% full. >>> >>> So now when all the disks are back, two disks will not have >>> In_sync flag? hope I understood this correctly. >>> >>> Now. >>> 15% of the data have lost two stripes. >>> 25% of the data have lost one stripe. >>> 50% of the data did not loose any stripe at all. >>> >>> Now for read and reconstruct.. >>> 50% of the data does not need any reconstruction. >> >> A 'Yes' for this, actually I should have avoided reading from >> out-of-sync disks, but I forgot to do so. So as a side effect, if the >> data read out from a out-of-sync disk matches its checksum, then it's >> proven to be OK. >> >>> 25% of the data needs RAID5 style reconstruction as they have lost only one >>> stripe. >>> 15% of the data needs RAID6 reconstruction since they have lost two >>> stripes. >>> >> >> It depends, the stripes on the out-of-sync disks could be >> a) all data stripes(as the example in this patch) >> b) one data stripe and one stripe P >> c) one data stripe and one stripe Q >> >> with In_sync flag, >> for a) raid6 reconstruction, >> for b) raid6 reconstruction, >> for c) raid5 reconstruction. >> >>> Does InSync design here help to work like this ? >>> >>> Now for RAID1, lets say disk1 lost first 10% of the stripes and >>> disk2 lost the last 10% of the stripes and these stripes are mutually >>> exclusive. That means there is no single stripe which has lost >>> completely. >>> >> >> True, probably we can drop patch 3 to let read happen on out-of-sync >> disks and go the existing repair routine. >> >>> There is no code yet where I can see when the In_sync will be reset, >>> which is fine for this eg. Assume no reconstruction is done. But now >>> when both the disks are being mounted and as the In_sync will be based >>> on dev_item.generation, one of it will get In_sync to me it looks like >>> there will be 10% of stripes which will fail even though its not >>> practically true in this example. >>> >> >> Not sure if I understand this paragraph, are you talking about raid1 >> or raid6? >> >> Regarding to when to reset In_sync, my naive idea is to utilize scrub >> to verify checksum on all content that filesystem knows and do the >> repair work, If it turns out that everything looks good, then we know >> the disk is already resync, it's a good time to set In_sync. >> >> Thanks, >> >> -liubo >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >
diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c index 67262f8..3c0ce61 100644 --- a/fs/btrfs/raid56.c +++ b/fs/btrfs/raid56.c @@ -1076,7 +1076,8 @@ static int rbio_add_io_page(struct btrfs_raid_bio *rbio, disk_start = stripe->physical + (page_index << PAGE_SHIFT); /* if the device is missing, just fail this stripe */ - if (!stripe->dev->bdev) + if (!stripe->dev->bdev || + !test_bit(In_sync, &stripe->dev->flags)) return fail_rbio_index(rbio, stripe_nr); /* see if we can add this page onto our existing bio */
With raid6 profile, btrfs can end up with data corruption by the following steps. Say we have a 5 disks that are set up with raid6 profile, 1) mount this btrfs 2) one disk gets pulled out 3) write something to btrfs and sync 4) another disk gets pulled out 5) write something to btrfs and sync 6) umount this btrfs 7) bring the two disk back 8) reboot 9) mount this btrfs Chances are mount will fail (even with -odegraded) because of failure on reading metadata blocks, IOW, our raid6 setup is not able to protect two disk failures in some cases. So it turns out that there is a bug in raid6's recover code, that is, if we have one of stripes in the raid6 layout as shown here, | D1(*) | D2(*) | D3 | D4 | D5 | ------------------------------------- | data1 | data2 | P | Q | data0 | D1 and D2 are the two disks which got pulled out and brought back. When mount reads data1 and finds out that data1 doesn't match its crc, btrfs goes to recover data1 from other stripes, what it'll be doing is 1) read data2, parity P, parity Q and data0 2) as we have a valid parity P and two data stripes, it goes to recover in raid5 style. (However, since disk D2 was pulled out and data2 on D2 could be stale, we still get the wrong data1 from this reconstruction.) 3) btrfs continues to try to reconstruct data1 from parity Q, data2 and data0, we still get the wrong one for the same reason. The fix here is to take advantage of the device flag, ie. 'In_sync', all data on a device might be stale if 'In_sync' has not been set. With this, we can build the correct data2 from parity P, Q and data1. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> --- fs/btrfs/raid56.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)