Message ID | 1395002246-3840-1-git-send-email-fdmanana@gmail.com (mailing list archive) |
---|---|
State | Accepted |
Headers | show |
I just created this array: polgara:/mnt/btrfs_backupcopy# btrfs fi show Label: backupcopy uuid: 7d8e1197-69e4-40d8-8d86-278d275af896 Total devices 10 FS bytes used 220.32GiB devid 1 size 465.76GiB used 25.42GiB path /dev/dm-0 devid 2 size 465.76GiB used 25.40GiB path /dev/dm-1 devid 3 size 465.75GiB used 25.40GiB path /dev/mapper/crypt_sde1 devid 4 size 465.76GiB used 25.40GiB path /dev/dm-3 devid 5 size 465.76GiB used 25.40GiB path /dev/dm-4 devid 6 size 465.76GiB used 25.40GiB path /dev/dm-5 devid 7 size 465.76GiB used 25.40GiB path /dev/dm-6 devid 8 size 465.76GiB used 25.40GiB path /dev/mapper/crypt_sdj1 devid 9 size 465.76GiB used 25.40GiB path /dev/dm-9 devid 10 size 465.76GiB used 25.40GiB path /dev/dm-8 And clearly it has issues with one of the drives. I have a copy that is still going on to it. Last I tried to boot a raid5 btrfs array with a drive missing, that didn't work at all. Since this array is still running, what are my options? I can't tell btrfs to replace drive sde1 with a new drive I plugged in because the code doesn't exist, correct? If I yank, sde1 and reboot, the array will not come back up from what I understand, or is that incorrect? Do rebuilds work at all with a missing drive to a spare drive? This is with 3.14.0-rc5. Do I have other options? (data is not important at all, I just want to learn how to deal with such a case with the current code) [59532.543415] BTRFS: bdev /dev/mapper/crypt_sde1 errs: wr 162, rd 2444, flush 0, corrupt 0, gen 0 [59547.654888] BTRFS: bdev /dev/mapper/crypt_sde1 errs: wr 162, rd 2445, flush 0, corrupt 0, gen 0 [59547.655755] BTRFS: bdev /dev/mapper/crypt_sde1 errs: wr 162, rd 2446, flush 0, corrupt 0, gen 0 [59552.096038] BTRFS: bdev /dev/mapper/crypt_sde1 errs: wr 162, rd 2447, flush 0, corrupt 0, gen 0 [59552.096613] BTRFS: bdev /dev/mapper/crypt_sde1 errs: wr 162, rd 2448, flush 0, corrupt 0, gen 0 [59557.124736] BTRFS: bdev /dev/mapper/crypt_sde1 errs: wr 162, rd 2449, flush 0, corrupt 0, gen 0 [59557.125569] BTRFS: bdev /dev/mapper/crypt_sde1 errs: wr 162, rd 2450, flush 0, corrupt 0, gen 0 [59572.694548] BTRFS: lost page write due to I/O error on /dev/mapper/crypt_sde1 [59572.695757] BTRFS: bdev /dev/mapper/crypt_sde1 errs: wr 163, rd 2450, flush 0, corrupt 0, gen 0 [59572.696295] BTRFS: lost page write due to I/O error on /dev/mapper/crypt_sde1 [59572.696976] BTRFS: bdev /dev/mapper/crypt_sde1 errs: wr 164, rd 2450, flush 0, corrupt 0, gen 0 [59572.697693] BTRFS: lost page write due to I/O error on /dev/mapper/crypt_sde1 [59572.698397] BTRFS: bdev /dev/mapper/crypt_sde1 errs: wr 165, rd 2450, flush 0, corrupt 0, gen 0 [59586.844083] BTRFS: bdev /dev/mapper/crypt_sde1 errs: wr 165, rd 2451, flush 0, corrupt 0, gen 0 [59586.844614] BTRFS: bdev /dev/mapper/crypt_sde1 errs: wr 165, rd 2452, flush 0, corrupt 0, gen 0 [59587.087696] BTRFS: bdev /dev/mapper/crypt_sde1 errs: wr 165, rd 2453, flush 0, corrupt 0, gen 0 [59587.088378] BTRFS: bdev /dev/mapper/crypt_sde1 errs: wr 165, rd 2454, flush 0, corrupt 0, gen 0 [59587.188784] BTRFS: bdev /dev/mapper/crypt_sde1 errs: wr 165, rd 2455, flush 0, corrupt 0, gen 0 [59587.189280] BTRFS: bdev /dev/mapper/crypt_sde1 errs: wr 165, rd 2456, flush 0, corrupt 0, gen 0 [59587.189737] BTRFS: bdev /dev/mapper/crypt_sde1 errs: wr 165, rd 2457, flush 0, corrupt 0, gen 0 [59612.829235] BTRFS: lost page write due to I/O error on /dev/mapper/crypt_sde1 [59612.829871] BTRFS: bdev /dev/mapper/crypt_sde1 errs: wr 166, rd 2457, flush 0, corrupt 0, gen 0 [59612.830767] BTRFS: lost page write due to I/O error on /dev/mapper/crypt_sde1 [59612.831397] BTRFS: bdev /dev/mapper/crypt_sde1 errs: wr 167, rd 2457, flush 0, corrupt 0, gen 0 [59612.832220] BTRFS: lost page write due to I/O error on /dev/mapper/crypt_sde1 [59612.832848] BTRFS: bdev /dev/mapper/crypt_sde1 errs: wr 168, rd 2457, flush 0, corrupt 0, gen 0 [59648.014743] BTRFS: lost page write due to I/O error on /dev/mapper/crypt_sde1 [59648.015221] BTRFS: bdev /dev/mapper/crypt_sde1 errs: wr 169, rd 2457, flush 0, corrupt 0, gen 0 [59648.015694] BTRFS: lost page write due to I/O error on /dev/mapper/crypt_sde1 [59648.016154] BTRFS: bdev /dev/mapper/crypt_sde1 errs: wr 170, rd 2457, flush 0, corrupt 0, gen 0 [59648.017249] BTRFS: lost page write due to I/O error on /dev/mapper/crypt_sde1 By the way, I found this very amusing: polgara:/mnt/btrfs_backupcopy# smartctl -i /dev/sde smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.14.0-rc5-amd64-i915-preempt-20140216c] (local build) Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Vendor: /14:0:0: Product: 0 User Capacity: 600,332,565,813,390,450 bytes [600 PB] Logical block size: 774843950 bytes Physical block size: 1549687900 bytes I have a 600PB drive for sale, please make me offers :) Thanks, Marc
On Mar 16, 2014, at 4:20 PM, Marc MERLIN <marc@merlins.org> wrote: > If I yank, sde1 and reboot, the array will not come back up from what I understand, > or is that incorrect? > Do rebuilds work at all with a missing drive to a spare drive? The part that isn't working well enough is faulty status. The drive keeps hanging around producing a lot of errors, instead of getting booted. btrfs replace start ought to still work, but if the faulty drive is fussy it might slow down the rebuild, or even prevent it. The more conservative approach is to pull the drive. If you've previously tested this hardware setup to tolerate hot swap, you can give that a shot. Otherwise, to avoid instability and additional problems, unmount the file system first. Do the hot swap. Then mount it -o degraded. Then use btrfs replace start. Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mar 16, 2014, at 4:55 PM, Chris Murphy <lists@colorremedies.com> wrote:
> Then use btrfs replace start.
Looks like in 3.14rc6 replace isn't yet supported. I get "dev_replace cannot yet handle RAID5/RAID6".
When I do:
btrfs device add <new> <mp>
The command hangs, no kernel messages.
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, Mar 16, 2014 at 05:12:10PM -0600, Chris Murphy wrote: > > On Mar 16, 2014, at 4:55 PM, Chris Murphy <lists@colorremedies.com> wrote: > > > Then use btrfs replace start. > > Looks like in 3.14rc6 replace isn't yet supported. I get "dev_replace cannot yet handle RAID5/RAID6". > > When I do: > btrfs device add <new> <mp> > > The command hangs, no kernel messages. Ok, that's kind of what I thought. So, for now, with raid5: - btrfs seems to handle a drive not working - you say I can mount with the drive missing in degraded mode (I haven't tried that, I will) - but no matter how I remove the faulty drive, there is no rebuild on a new drive procedure that works yet Correct? Thanks, Marc
On Mar 16, 2014, at 5:12 PM, Chris Murphy <lists@colorremedies.com> wrote: > > On Mar 16, 2014, at 4:55 PM, Chris Murphy <lists@colorremedies.com> wrote: > >> Then use btrfs replace start. > > Looks like in 3.14rc6 replace isn't yet supported. I get "dev_replace cannot yet handle RAID5/RAID6". > > When I do: > btrfs device add <new> <mp> > > The command hangs, no kernel messages. So even though the device add command hangs, another shell with btrfs fi show reports that it succeeded: Label: none uuid: d50b6c0f-518a-455f-9740-e29779649250 Total devices 4 FS bytes used 5.70GiB devid 1 size 7.81GiB used 4.02GiB path /dev/sdb devid 2 size 7.81GiB used 3.01GiB path /dev/sdc devid 3 size 7.81GiB used 4.01GiB path devid 4 size 7.81GiB used 0.00 path /dev/sdd Yet umount <mp> says the target is busy. ps reports the command status D+. And it doesn't cancel. So at the moment I'm stuck coming up with a work around. Chris Murphy-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mar 16, 2014, at 5:17 PM, Marc MERLIN <marc@merlins.org> wrote: > - but no matter how I remove the faulty drive, there is no rebuild on a > new drive procedure that works yet > > Correct? I'm not sure. From what I've read we should be able to add a device to raid5/6, but I don't know if it's expected we can add a device to a degraded raid5/6. If the add device succeeded, then I ought to be able to remove the missing devid, and then do a balance which should cause reconstruction. https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg30714.html Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, Mar 16, 2014 at 4:17 PM, Marc MERLIN <marc@merlins.org> wrote: > On Sun, Mar 16, 2014 at 05:12:10PM -0600, Chris Murphy wrote: >> >> On Mar 16, 2014, at 4:55 PM, Chris Murphy <lists@colorremedies.com> wrote: >> >> > Then use btrfs replace start. >> >> Looks like in 3.14rc6 replace isn't yet supported. I get "dev_replace cannot yet handle RAID5/RAID6". >> >> When I do: >> btrfs device add <new> <mp> >> >> The command hangs, no kernel messages. > > Ok, that's kind of what I thought. > So, for now, with raid5: > - btrfs seems to handle a drive not working > - you say I can mount with the drive missing in degraded mode (I haven't > tried that, I will) > - but no matter how I remove the faulty drive, there is no rebuild on a > new drive procedure that works yet > > Correct? There was a discussion a while back that suggested that a "balance" would read all blocks and write them out again and that would recover the data. I have no idea if that works or not. Only do this as a last resort once you have already considered all data lost forever. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, Mar 16, 2014 at 05:23:25PM -0600, Chris Murphy wrote: > > On Mar 16, 2014, at 5:17 PM, Marc MERLIN <marc@merlins.org> wrote: > > > - but no matter how I remove the faulty drive, there is no rebuild on a > > new drive procedure that works yet > > > > Correct? > > I'm not sure. From what I've read we should be able to add a device to raid5/6, but I don't know if it's expected we can add a device to a degraded raid5/6. If the add device succeeded, then I ought to be able to remove the missing devid, and then do a balance which should cause reconstruction. > > https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg30714.html Thanks for the link, that's what I thought I read recently. So, on 3.14, I can confirm polgara:/mnt/btrfs_backupcopy# btrfs replace start 3 /dev/sdm1 /mnt/btrfs_backupcopy [68377.679233] BTRFS warning (device dm-9): dev_replace cannot yet handle RAID5/RAID6 polgara:/mnt/btrfs_backupcopy# btrfs device delete /dev/mapper/crypt_sde1 `pwd` ERROR: error removing the device '/dev/mapper/crypt_sde1' - Invalid argument and yet Mar 16 17:48:35 polgara kernel: [69285.032615] BTRFS: bdev /dev/mapper/crypt_sde1 errs: wr 805, rd 4835, flush 0, corrupt 0, gen 0 Mar 16 17:48:35 polgara kernel: [69285.033791] BTRFS: lost page write due to I/O error on /dev/mapper/crypt_sde1 Mar 16 17:48:35 polgara kernel: [69285.034379] BTRFS: bdev /dev/mapper/crypt_sde1 errs: wr 806, rd 4835, flush 0, corrupt 0, gen 0 Mar 16 17:48:35 polgara kernel: [69285.035361] BTRFS: lost page write due to I/O error on /dev/mapper/crypt_sde1 Mar 16 17:48:35 polgara kernel: [69285.035943] BTRFS: bdev /dev/mapper/crypt_sde1 errs: wr 807, rd 4835, flush 0, corrupt 0, gen 0 So from here, it sounds like I can try: 1) unmount the filesystem 2) hope that remounting it without that device will work 3) btrfs device add to recreate the missing drive. Before I do #1 and get myself in a worse state than I am (working filesystem), does that sound correct? (again, the data is irrelevant, I have a btrfs receive on it that has been running for hours and that I'd have to restart, but that's it). Thanks, Marc
On Mar 16, 2014, at 6:51 PM, Marc MERLIN <marc@merlins.org> wrote: > > > polgara:/mnt/btrfs_backupcopy# btrfs device delete /dev/mapper/crypt_sde1 `pwd` > ERROR: error removing the device '/dev/mapper/crypt_sde1' - Invalid argument You didn't specify a mount point, is the reason for that error. But also, since you're already effectively degraded with 1 disk you can't remove a 2nd without causing array collapse. You have to add a new device first *and* you have to "rebuild" with balance. Then presumably we can remove the device. But I'm stuck adding so I can't test anything else. > > So from here, it sounds like I can try: > 1) unmount the filesystem > 2) hope that remounting it without that device will work > 3) btrfs device add to recreate the missing drive. > > Before I do #1 and get myself in a worse state than I am (working > filesystem), does that sound correct? > > (again, the data is irrelevant, I have a btrfs receive on it that has > been running for hours and that I'd have to restart, but that's it). Well at this point I'd leave it alone because at least for me, device add hangs that command and all other subsequent btrfs user space commands. So for all I know (untested) the whole volume will block on this device add and is effectively useless. Chris Murphy-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, Mar 16, 2014 at 07:06:23PM -0600, Chris Murphy wrote: > > On Mar 16, 2014, at 6:51 PM, Marc MERLIN <marc@merlins.org> wrote: > > > > > > polgara:/mnt/btrfs_backupcopy# btrfs device delete /dev/mapper/crypt_sde1 `pwd` > > ERROR: error removing the device '/dev/mapper/crypt_sde1' - Invalid argument > > You didn't specify a mount point, is the reason for that error. But also, since you're already effectively degraded with 1 disk you can't remove a 2nd without causing array collapse. You have to add a new device first *and* you have to "rebuild" with balance. Then presumably we can remove the device. But I'm stuck adding so I can't test anything else. You missed the `pwd` :) I'm trying to remove the drive that is causing issues, that doesn't make things worse, does it? Does btrtfs not know that device is the bad one even thouth it's spamming my logs continuously about it? If I add a device, isn't it going to grow my raid to make it bigger instead of trying to replace the bad device? In swraid5, if I add a device, it will grow the raid, unless the array is running in degraded mode. However, I can't see if btrfs tools know it's in degraded mode or not. If you are sure adding a device won't grow my raid, I'll give it a shot. > > (again, the data is irrelevant, I have a btrfs receive on it that has > > been running for hours and that I'd have to restart, but that's it). > > Well at this point I'd leave it alone because at least for me, device add hangs that command and all other subsequent btrfs user space commands. So for all I know (untested) the whole volume will block on this device add and is effectively useless. Right. I was hoping that my kernel slightly newer than yours and maybe real devices would help, but of course I don't know that. I'll add the new device first after you confirm that there is no chance it'll try to grow the filesystem :) Thanks, Marc
On Mar 16, 2014, at 7:17 PM, Marc MERLIN <marc@merlins.org> wrote: > On Sun, Mar 16, 2014 at 07:06:23PM -0600, Chris Murphy wrote: >> >> On Mar 16, 2014, at 6:51 PM, Marc MERLIN <marc@merlins.org> wrote: >>> >>> >>> polgara:/mnt/btrfs_backupcopy# btrfs device delete /dev/mapper/crypt_sde1 `pwd` >>> ERROR: error removing the device '/dev/mapper/crypt_sde1' - Invalid argument >> >> You didn't specify a mount point, is the reason for that error. But also, since you're already effectively degraded with 1 disk you can't remove a 2nd without causing array collapse. You have to add a new device first *and* you have to "rebuild" with balance. Then presumably we can remove the device. But I'm stuck adding so I can't test anything else. > > You missed the `pwd` :) I just don't know what it means, it's not a reference to mount point I'm familiar with. > I'm trying to remove the drive that is causing issues, that doesn't make > things worse, does it? I don't think you can force a Btrfs volume to go degraded with a device delete command right now, just like there isn't a command to make it go missing or faulty, like md raid. > Does btrtfs not know that device is the bad one even thouth it's spamming my > logs continuously about it? With raid5, you're always at the minimum number of devices to be normally mounted. Removing one immediately makes it degraded which I don't think it's going to permit. At least, I get an error when I do it even without a device giving me fits. > > If I add a device, isn't it going to grow my raid to make it bigger instead > of trying to replace the bad device? Yes if it's successful. No if it fails which is the problem I'm having. > In swraid5, if I add a device, it will grow the raid, unless the array is > running in degraded mode. > However, I can't see if btrfs tools know it's in degraded mode or not. Only once the device is missing, apparently, and then mounted -o degraded. > > If you are sure adding a device won't grow my raid, I'll give it a shot. No I'm not sure. And yes I suspect it will make it bigger. But so far a.) replace isn't supported yet; and b.) delete causes the volume to go below the minimum required for normal operation which it won't allow; which leaves c.) add a device but I'm getting a hang. So I'm stuck at this point. > >>> (again, the data is irrelevant, I have a btrfs receive on it that has >>> been running for hours and that I'd have to restart, but that's it). >> >> Well at this point I'd leave it alone because at least for me, device add hangs that command and all other subsequent btrfs user space commands. So for all I know (untested) the whole volume will block on this device add and is effectively useless. > > Right. I was hoping that my kernel slightly newer than yours and maybe real > devices would help, but of course I don't know that. > > I'll add the new device first after you confirm that there is no chance > it'll try to grow the filesystem :) I confirm nothing since I can't proceed with a device add. Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, Mar 16, 2014 at 08:56:35PM -0600, Chris Murphy wrote: > >>> polgara:/mnt/btrfs_backupcopy# btrfs device delete /dev/mapper/crypt_sde1 `pwd` > >>> ERROR: error removing the device '/dev/mapper/crypt_sde1' - Invalid argument > >> > >> You didn't specify a mount point, is the reason for that error. But also, since you're already effectively degraded with 1 disk you can't remove a 2nd without causing array collapse. You have to add a new device first *and* you have to "rebuild" with balance. Then presumably we can remove the device. But I'm stuck adding so I can't test anything else. > > > > You missed the `pwd` :) > > I just don't know what it means, it's not a reference to mount point I'm familiar with. Try echo `pwd` and you'll understand :) > > I'm trying to remove the drive that is causing issues, that doesn't make > > things worse, does it? > > I don't think you can force a Btrfs volume to go degraded with a device delete command right now, just like there isn't a command to make it go missing or faulty, like md raid. > > Does btrtfs not know that device is the bad one even thouth it's spamming my > > logs continuously about it? > > With raid5, you're always at the minimum number of devices to be normally mounted. Removing one immediately makes it degraded which I don't think it's going to permit. At least, I get an error when I do it even without a device giving me fits. Ok, I understand that. > > If I add a device, isn't it going to grow my raid to make it bigger instead > > of trying to replace the bad device? > > Yes if it's successful. No if it fails which is the problem I'm having. That's where I don't follow you. You just agreed that it will grow my raid. So right now it's 4.5TB with 10 drives, if I add one drive, it will grow to 5TB with 11 drives. How does that help? Why would btrfs allow me to remove the faulty device since it does not let you remove a device from a running raid. If I grow it to a bigger raid, it still won't let me remove the device, will it? > > In swraid5, if I add a device, it will grow the raid, unless the array is > > running in degraded mode. > > However, I can't see if btrfs tools know it's in degraded mode or not. > > Only once the device is missing, apparently, and then mounted -o degraded. Dully noted. If you agree that adding an 11th drive to my array will not help, I'll unmount the filesystem, remount it in degraded mode with 9 drives and try to add the new 11th drive. > > If you are sure adding a device won't grow my raid, I'll give it a shot. > > No I'm not sure. And yes I suspect it will make it bigger. But so far a.) replace isn't supported yet; and b.) delete causes the volume to go below the minimum required for normal operation which it won't allow; which leaves c.) add a device but I'm getting a hang. So I'm stuck at this point. Right. So I think we also agree that adding a device to the running filesystem is not what I want to do since it'll grow it and do nothing to let me remove he faulty one. > > I'll add the new device first after you confirm that there is no chance > > it'll try to grow the filesystem :) > > I confirm nothing since I can't proceed with a device add. Fair enough. So unless someone tells me otherwise, I will unmount the filesystem, remount it in degraded mode, and then try to add the 11th drive when the 10th one is missing. Thanks, Marc
On Mar 16, 2014, at 9:44 PM, Marc MERLIN <marc@merlins.org> wrote: > On Sun, Mar 16, 2014 at 08:56:35PM -0600, Chris Murphy wrote: > >>> If I add a device, isn't it going to grow my raid to make it bigger instead >>> of trying to replace the bad device? >> >> Yes if it's successful. No if it fails which is the problem I'm having. > > That's where I don't follow you. > You just agreed that it will grow my raid. > So right now it's 4.5TB with 10 drives, if I add one drive, it will grow to > 5TB with 11 drives. > How does that help? If you swap the faulty drive for a good drive, I'm thinking then you'll be able to device delete the bad device, which ought to be "missing" at that point; or if that fails you should be able to do a balance, and then be able to device delete the faulty drive. The problem I'm having is that when I detach one device out of a 3 device raid5, btrfs fi show doesn't list it as missing. It's listed without the /dev/sdd designation it had when attached, but now it's just blank. > Why would btrfs allow me to remove the faulty device since it does not let > you remove a device from a running raid. If I grow it to a bigger raid, it > still won't let me remove the device, will it? Maybe not, but it seems like it ought to let you balance, which should only be across available devices at which point you should be able to device delete the bad one. That's assuming you've physically detached the faulty device from the start though. > >>> In swraid5, if I add a device, it will grow the raid, unless the array is >>> running in degraded mode. >>> However, I can't see if btrfs tools know it's in degraded mode or not. >> >> Only once the device is missing, apparently, and then mounted -o degraded. > > Dully noted. > If you agree that adding an 11th drive to my array will not help, I'll > unmount the filesystem, remount it in degraded mode with 9 drives and try to > add the new 11th drive. That's the only option I see at the moment in any case, other than blowing it all away and starting from scratch. What I don't know is whether you will be able to 'btrfs device delete' what ought to now be a missing device, since you have enough drives added to proceed with that deletion; or if you'll have to balance first. And I don't even know if the balance will work and then let you device delete if it's a dead end at this point. > >>> If you are sure adding a device won't grow my raid, I'll give it a shot. >> >> No I'm not sure. And yes I suspect it will make it bigger. But so far a.) replace isn't supported yet; and b.) delete causes the volume to go below the minimum required for normal operation which it won't allow; which leaves c.) add a device but I'm getting a hang. So I'm stuck at this point. > > Right. So I think we also agree that adding a device to the running > filesystem is not what I want to do since it'll grow it and do nothing to > let me remove he faulty one. The grow is entirely beside the point. You definitely can't btrfs replace, or btrfs device delete, so what else is there but try btrfs device add, or obliterate it and start over? Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, Mar 16, 2014 at 11:12:43PM -0600, Chris Murphy wrote: > > On Mar 16, 2014, at 9:44 PM, Marc MERLIN <marc@merlins.org> wrote: > > > On Sun, Mar 16, 2014 at 08:56:35PM -0600, Chris Murphy wrote: > > > >>> If I add a device, isn't it going to grow my raid to make it bigger instead > >>> of trying to replace the bad device? > >> > >> Yes if it's successful. No if it fails which is the problem I'm having. > > > > That's where I don't follow you. > > You just agreed that it will grow my raid. > > So right now it's 4.5TB with 10 drives, if I add one drive, it will grow to > > 5TB with 11 drives. > > How does that help? > > If you swap the faulty drive for a good drive, I'm thinking then you'll be able to device delete the bad device, which ought to be "missing" at that point; or if that fails you should be able to do a balance, and then be able to device delete the faulty drive. > > The problem I'm having is that when I detach one device out of a 3 device raid5, btrfs fi show doesn't list it as missing. It's listed without the /dev/sdd designation it had when attached, but now it's just blank. Ok, I tried unmounting and remounting degraded this morning: polgara:~# mount -v -t btrfs -o compress=zlib,space_cache,noatime,degraded LABEL=backupcopy /mnt/btrfs_backupcopy Mar 17 08:57:35 polgara kernel: [123824.344085] BTRFS: device label backupcopy devid 9 transid 3837 /dev/mapper/crypt_sdk1 Mar 17 08:57:35 polgara kernel: [123824.454641] BTRFS info (device dm-9): allowing degraded mounts Mar 17 08:57:35 polgara kernel: [123824.454978] BTRFS info (device dm-9): disk space caching is enabled Mar 17 08:57:35 polgara kernel: [123824.497437] BTRFS: bdev /dev/mapper/crypt_sde1 errs: wr 3888, rd 321927975, flush 0, corrupt 0, gen 0 /dev/mapper/crypt_sdk1 on /mnt/btrfs_backupcopy type btrfs (rw,noatime,compress=zlib,space_cache,degraded) What's confusing is that mounting in degraded mode shows all devices: polgara:~# btrfs fi show Label: backupcopy uuid: 7d8e1197-69e4-40d8-8d86-278d275af896 Total devices 10 FS bytes used 376.27GiB devid 1 size 465.76GiB used 42.42GiB path /dev/dm-0 devid 2 size 465.76GiB used 42.40GiB path /dev/dm-1 devid 3 size 465.75GiB used 42.40GiB path /dev/mapper/crypt_sde1 << this is missing devid 4 size 465.76GiB used 42.40GiB path /dev/dm-3 devid 5 size 465.76GiB used 42.40GiB path /dev/dm-4 devid 6 size 465.76GiB used 42.40GiB path /dev/dm-5 devid 7 size 465.76GiB used 42.40GiB path /dev/dm-6 devid 8 size 465.76GiB used 42.40GiB path /dev/mapper/crypt_sdj1 devid 9 size 465.76GiB used 42.40GiB path /dev/mapper/crypt_sdk1 devid 10 size 465.76GiB used 42.40GiB path /dev/dm-8 Ok, so mount in degraded mode works. Adding a new device failed though: polgara:~# btrfs device add /dev/mapper/crypt_sdm1 /mnt/btrfs_backupcopy BTRFS: bad tree block start 852309604880683448 156237824 ------------[ cut here ]------------ WARNING: CPU: 0 PID: 1963 at fs/btrfs/super.c:257 __btrfs_abort_transaction+0x50/0x100() BTRFS: Transaction aborted (error -5) Modules linked in: xts gf128mul ipt_MASQUERADE ipt_REJECT xt_tcpudp xt_conntrack xt_LOG iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle ip6table_filter ip6_tables iptable_filter ip_tables ebtable_nat ebtables x_tables cpufreq_userspace cpufreq_powersave cpufreq_conservative cpufreq_stats ppdev rfcomm bnep autofs4 binfmt_misc uinput nfsd auth_rpcgss nfs_acl nfs lockd fscache sunrpc fuse dm_crypt dm_mod configs parport_pc lp parport input_polldev loop firewire_sbp2 firewire_core crc_itu_t ecryptfs btusb bluetooth 6lowpan_iphc rfkill usbkbd usbmouse joydev hid_generic usbhid hid iTCO_wdt iTCO_vendor_support gpio_ich coretemp kvm_intel kvm microcode snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel snd_hda_codec pcspkr snd_hwdep i2c_i801 snd_pcm_oss snd_mixer_oss lpc_ich snd_pcm snd_seq_midi snd_seq_midi_event sg sr_mod cdrom snd_rawmidi snd_seq snd_seq_device snd_timer atl1 mii mvsas snd nouveau libsas scsi_transport_ soundcore ttm ehci_pci asus_atk0110 floppy uhci_hcd ehci_hcd usbcore acpi_cpufreq usb_common processor evdev CPU: 0 PID: 1963 Comm: btrfs Tainted: G W 3.14.0-rc5-amd64-i915-preempt-20140216c #1 Hardware name: System manufacturer P5KC/P5KC, BIOS 0502 05/24/2007 0000000000000000 ffff88004b5c9988 ffffffff816090b3 ffff88004b5c99d0 ffff88004b5c99c0 ffffffff81050025 ffffffff8120913a 00000000fffffffb ffff8800144d5800 ffff88007bd3ba00 ffffffff81839280 ffff88004b5c9a20 Call Trace: [<ffffffff816090b3>] dump_stack+0x4e/0x7a [<ffffffff81050025>] warn_slowpath_common+0x7f/0x98 [<ffffffff8120913a>] ? __btrfs_abort_transaction+0x50/0x100 [<ffffffff8105008a>] warn_slowpath_fmt+0x4c/0x4e [<ffffffff8120913a>] __btrfs_abort_transaction+0x50/0x100 [<ffffffff81216fed>] __btrfs_free_extent+0x6ce/0x712 [<ffffffff8121bc89>] __btrfs_run_delayed_refs+0x939/0xbdf [<ffffffff8121dac8>] btrfs_run_delayed_refs+0x81/0x18f [<ffffffff8122aeb2>] btrfs_commit_transaction+0xeb/0x849 [<ffffffff8124e777>] btrfs_init_new_device+0x9a1/0xc00 [<ffffffff8114069b>] ? ____cache_alloc+0x1c/0x29b [<ffffffff81129d3e>] ? mem_cgroup_end_update_page_stat+0x17/0x26 [<ffffffff8125570f>] ? btrfs_ioctl+0x989/0x24b1 [<ffffffff81141096>] ? __kmalloc_track_caller+0x130/0x144 [<ffffffff8125570f>] ? btrfs_ioctl+0x989/0x24b1 [<ffffffff81255730>] btrfs_ioctl+0x9aa/0x24b1 [<ffffffff81611e15>] ? __do_page_fault+0x330/0x3df [<ffffffff8116da43>] ? mntput_no_expire+0x33/0x12b [<ffffffff81163b16>] do_vfs_ioctl+0x3d2/0x41d [<ffffffff8115676b>] ? ____fput+0xe/0x10 [<ffffffff8106973a>] ? task_work_run+0x87/0x98 [<ffffffff81163bb8>] SyS_ioctl+0x57/0x82 [<ffffffff81611ed2>] ? do_page_fault+0xe/0x10 [<ffffffff816154ad>] system_call_fastpath+0x1a/0x1f ---[ end trace 7d08b9b7f2f17b38 ]--- BTRFS: error (device dm-9) in __btrfs_free_extent:5755: errno=-5 IO failure BTRFS info (device dm-9): forced readonly ERROR: error adding the device '/dev/mapper/crypt_sdm1' - Input/output error polgara:~# Mar 17 09:07:14 polgara kernel: [124403.240880] BTRFS: error (device dm-9) in btrfs_run_delayed_refs:2713: errno=-5 IO failure Mmmh, dm-9 is another device, although it seems to work: polgara:~# dd if=/dev/dm-9 of=/dev/null bs=1M ^C1255+0 records in 1254+0 records out 1314914304 bytes (1.3 GB) copied, 15.169 s, 86.7 MB/s polgara:~# btrfs device stats /dev/dm-9 [/dev/mapper/crypt_sdk1].write_io_errs 0 [/dev/mapper/crypt_sdk1].read_io_errs 0 [/dev/mapper/crypt_sdk1].flush_io_errs 0 [/dev/mapper/crypt_sdk1].corruption_errs 0 [/dev/mapper/crypt_sdk1].generation_errs 0 I also started getting errors on my device after hours of use last night (pasted below). Not sure if I really have a 2nd device problem or not: /dev/mapper/crypt_sde1 is dm-2, BTRFS: lost page write due to I/O error on /dev/mapper/crypt_sde1 BTRFS: lost page write due to I/O error on /dev/mapper/crypt_sde1 BTRFS: lost page write due to I/O error on /dev/mapper/crypt_sde1 BTRFS: lost page write due to I/O error on /dev/mapper/crypt_sde1 BTRFS: lost page write due to I/O error on /dev/mapper/crypt_sde1 BTRFS: lost page write due to I/O error on /dev/mapper/crypt_sde1 BTRFS: lost page write due to I/O error on /dev/mapper/crypt_sde1 BTRFS: lost page write due to I/O error on /dev/mapper/crypt_sde1 BTRFS: lost page write due to I/O error on /dev/mapper/crypt_sde1 quiet_error: 123 callbacks suppressed Buffer I/O error on device dm-2, logical block 16 Buffer I/O error on device dm-2, logical block 16384 Buffer I/O error on device dm-2, logical block 67108864 Buffer I/O error on device dm-2, logical block 16 Buffer I/O error on device dm-2, logical block 16384 Buffer I/O error on device dm-2, logical block 67108864 BTRFS: lost page write due to I/O error on /dev/mapper/crypt_sde1 BTRFS: lost page write due to I/O error on /dev/mapper/crypt_sde1 BTRFS: lost page write due to I/O error on /dev/mapper/crypt_sde1 Buffer I/O error on device dm-2, logical block 0 Buffer I/O error on device dm-2, logical block 1 Buffer I/O error on device dm-2, logical block 2 Buffer I/O error on device dm-2, logical block 3 Buffer I/O error on device dm-2, logical block 0 Buffer I/O error on device dm-2, logical block 122095101 Buffer I/O error on device dm-2, logical block 122095101 Buffer I/O error on device dm-2, logical block 0 Buffer I/O error on device dm-2, logical block 0 btrfs_dev_stat_print_on_error: 366 callbacks suppressed btrfs_dev_stat_print_on_error: 346 callbacks suppressed btrfs_dev_stat_print_on_error: 606 callbacks suppressed btrfs_dev_stat_print_on_error: 276 callbacks suppressed BTRFS: bad tree block start 16817792799093053571 2701656064 BTRFS: bad tree block start 16817792799093053571 2701656064 BTRFS: bad tree block start 16817792799093053571 2701656064 BTRFS: bad tree block start 16817792799093053571 2701656064 BTRFS: bad tree block start 16817792799093053571 2701656064 BTRFS: bad tree block start 16817792799093053571 2701656064 BTRFS: bad tree block start 16817792799093053571 2701656064 BTRFS: bad tree block start 16817792799093053571 2701656064 BTRFS: bad tree block start 16817792799093053571 2701656064 BTRFS: bad tree block start 16817792799093053571 2701656064 btrfs_dev_stat_print_on_error: 11469 callbacks suppressed btree_readpage_end_io_hook: 31227 callbacks suppressed BTRFS: bad tree block start 16817792799093053571 2701656064 BTRFS: bad tree block start 16817792799093053571 2701656064 BTRFS: bad tree block start 16817792799093053571 2701656064 eventually it turned into: BTRFS: bdev /dev/mapper/crypt_sde1 errs: wr 3891, rd 321927996, flush 0, corrupt 0, gen 0 BTRFS: bdev /dev/mapper/crypt_sde1 errs: wr 3891, rd 321927997, flush 0, corrupt 0, gen 0 BTRFS: bad tree block start 17271740454546054736 1265680384 ------------[ cut here ]------------ WARNING: CPU: 1 PID: 10414 at fs/btrfs/super.c:257 __btrfs_abort_transaction+0x50/0x100() BTRFS: Transaction aborted (error -5) Modules linked in: xts gf128mul ipt_MASQUERADE ipt_REJECT xt_tcpudp xt_conntrack xt_LOG iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle ip6table_filter ip6_tables iptable_filter ip_tables ebtable_nat ebtables x_tables cpufreq_userspace cpufreq_powersave cpufreq_conservative cpufreq_stats ppdev rfcomm bnep autofs4 binfmt_misc uinput nfsd auth_rpcgss nfs_acl nfs lockd fscache sunrpc fuse dm_crypt dm_mod configs parport_pc lp parport input_polldev loop firewire_sbp2 firewire_core crc_itu_t ecryptfs btusb bluetooth 6lowpan_iphc rfkill usbkbd usbmouse joydev hid_generic usbhid hid iTCO_wdt iTCO_vendor_support gpio_ich coretemp kvm_intel kvm microcode snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel snd_hda_codec pcspkr snd_hwdep i2c_i801 snd_pcm_oss snd_mixer_oss lpc_ich snd_pcm snd_seq_midi snd_seq_midi_event sg sr_mod cdrom snd_rawmidi snd_seq snd_seq_device snd_timer atl1 mii mvsas snd nouveau libsas scsi_transport_ soundcore ttm ehci_pci asus_atk0110 floppy uhci_hcd ehci_hcd usbcore acpi_cpufreq usb_common processor evdev CPU: 1 PID: 10414 Comm: btrfs-transacti Not tainted 3.14.0-rc5-amd64-i915-preempt-20140216c #1 Hardware name: System manufacturer P5KC/P5KC, BIOS 0502 05/24/2007 0000000000000000 ffff88004ae4fb30 ffffffff816090b3 ffff88004ae4fb78 ffff88004ae4fb68 ffffffff81050025 ffffffff8120913a 00000000fffffffb ffff88004f2e7800 ffff8800603804c0 ffffffff81839280 ffff88004ae4fbc8 Call Trace: [<ffffffff816090b3>] dump_stack+0x4e/0x7a [<ffffffff81050025>] warn_slowpath_common+0x7f/0x98 [<ffffffff8120913a>] ? __btrfs_abort_transaction+0x50/0x100 [<ffffffff8105008a>] warn_slowpath_fmt+0x4c/0x4e [<ffffffff8120913a>] __btrfs_abort_transaction+0x50/0x100 [<ffffffff81216fed>] __btrfs_free_extent+0x6ce/0x712 [<ffffffff8121bc89>] __btrfs_run_delayed_refs+0x939/0xbdf [<ffffffff8121dac8>] btrfs_run_delayed_refs+0x81/0x18f [<ffffffff8122ae40>] btrfs_commit_transaction+0x79/0x849 [<ffffffff812277ca>] transaction_kthread+0xf8/0x1ab [<ffffffff812276d2>] ? btrfs_cleanup_transaction+0x43f/0x43f [<ffffffff8106bc56>] kthread+0xae/0xb6 [<ffffffff8106bba8>] ? __kthread_parkme+0x61/0x61 [<ffffffff816153fc>] ret_from_fork+0x7c/0xb0 [<ffffffff8106bba8>] ? __kthread_parkme+0x61/0x61 ---[ end trace 7d08b9b7f2f17b35 ]--- BTRFS: error (device dm-9) in __btrfs_free_extent:5755: errno=-5 IO failure BTRFS info (device dm-9): forced readonly BTRFS: error (device dm-9) in btrfs_run_delayed_refs:2713: errno=-5 IO failure ------------[ cut here ]------------
On Mar 17, 2014, at 10:13 AM, Marc MERLIN <marc@merlins.org> wrote: > > What's confusing is that mounting in degraded mode shows all devices: > polgara:~# btrfs fi show > Label: backupcopy uuid: 7d8e1197-69e4-40d8-8d86-278d275af896 > Total devices 10 FS bytes used 376.27GiB > devid 1 size 465.76GiB used 42.42GiB path /dev/dm-0 > devid 2 size 465.76GiB used 42.40GiB path /dev/dm-1 > devid 3 size 465.75GiB used 42.40GiB path /dev/mapper/crypt_sde1 << this is missing > devid 4 size 465.76GiB used 42.40GiB path /dev/dm-3 > devid 5 size 465.76GiB used 42.40GiB path /dev/dm-4 > devid 6 size 465.76GiB used 42.40GiB path /dev/dm-5 > devid 7 size 465.76GiB used 42.40GiB path /dev/dm-6 > devid 8 size 465.76GiB used 42.40GiB path /dev/mapper/crypt_sdj1 > devid 9 size 465.76GiB used 42.40GiB path /dev/mapper/crypt_sdk1 > devid 10 size 465.76GiB used 42.40GiB path /dev/dm-8 /dev/mapper/crypt_sde1 is completely unavailable, as in not listed by lsblk? If it's not connected and not listed by lsblk yet it's listed by btrfs fi show that's a bug. > > eventually it turned into: > BTRFS: bdev /dev/mapper/crypt_sde1 errs: wr 3891, rd 321927996, flush 0, corrupt 0, gen 0 > BTRFS: bdev /dev/mapper/crypt_sde1 errs: wr 3891, rd 321927997, flush 0, corrupt 0, gen 0 [snip] > BTRFS: error (device dm-9) in __btrfs_free_extent:5755: errno=-5 IO failure > BTRFS info (device dm-9): forced readonly > BTRFS: error (device dm-9) in btrfs_run_delayed_refs:2713: errno=-5 IO failure I think it's a lost cause at this point. Your setup is substantially more complicated than my simple setup, and I can't even get the simple setup to recover from an idealized single device raid5 failure. The only apparent way out is to mount degraded, backup, and then start over. In your case it looks like at least two devices are reporting, or Btrfs thinks they're reporting, I/O errors. Whether this is the physical drive itself, or if it's some other layer (it looks like these are dmcrypt logical block devices). Chris Murphy-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Marc MERLIN posted on Sun, 16 Mar 2014 15:20:26 -0700 as excerpted: > Do I have other options? > (data is not important at all, I just want to learn how to deal with > such a case with the current code) First just a note that you hijacked Mr Manana's patch thread. Replying to a post and changing the topic (the usual cause of such hijacks) does NOT change the thread, as the References and In-Reply-To headers still includes the Message-IDs from the original thread, and that's what good clients thread by since the subject line isn't a reliable means of threading. To start a NEW thread, don't reply to an existing thread, compose a NEW message, starting a NEW thread. =:^) Back on topic... Since you don't have to worry about the data I'd suggest blowing it away and starting over. Btrfs raid5/6 code is known to be incomplete at this point, to work in normal mode and write everything out, but with incomplete recovery code. So I'd treat it like the raid-0 mode it effectively is, and consider it lost if a device drops. There *IS* a post from an earlier thread where someone mentioned a recovery under some specific circumstance that worked for him, but I'd consider that the exception not the norm since the code is known to be incomplete and I think he just got lucky and didn't hit the particular missing code in his specific case. Certainly you could try to go back and see what he did and under what conditions, and that might actually be worth doing if you had valuable data you'd be losing otherwise, but since you don't, while of course it's up to you, I'd not bother were it me. Which I haven't. My use-case wouldn't be looking at raid5/6 (or raid0) anyway, but even if it were, I'd not touch the current code unless it /was/ just for something I'd consider risking on a raid0. Other than pure testing, the /only/ case I'd consider btrfs raid5/6 for right now, would be something that I'd consider raid0 riskable currently, but with the bonus of it upgrading "for free" to raid5/6 when the code is complete without any further effort on my part, since it's actually being written as raid5/6 ATM, the recovery simply can't be relied upon as raid5/6, so in recovery terms you're effectively running raid0 until it can be. Other than that and for /pure/ testing, I just don't see the point of even thinking about raid5/6 at this point.
On Tue, Mar 18, 2014 at 09:02:07AM +0000, Duncan wrote: > First just a note that you hijacked Mr Manana's patch thread. Replying (...) I did, I use mutt, I know about in Reply-To, I was tired, I screwed up, sorry, and there was no undo :) > Since you don't have to worry about the data I'd suggest blowing it away > and starting over. Btrfs raid5/6 code is known to be incomplete at this > point, to work in normal mode and write everything out, but with > incomplete recovery code. So I'd treat it like the raid-0 mode it > effectively is, and consider it lost if a device drops. > > Which I haven't. My use-case wouldn't be looking at raid5/6 (or raid0) > anyway, but even if it were, I'd not touch the current code unless it > /was/ just for something I'd consider risking on a raid0. Other than Thank you for the warning, and yes I know the risk and the data I'm putting on it is ok with that risk :) So, I was bit quiet because I diagnosed problems with the underlying hardware. My disk array was creating disk faults due to insufficient power coming in. Now that I fixed that and made sure the drives work with a full run of hdrecover of all the drives in parallel (exercises the drives while making sure all their blocks work), I did tests again: Summary: 1) You can grow and shrink a raid5 volume while it's mounted => very cool 2) shrinking causes a rebalance 3) growing requires you to run rebalance 4) btrfs cannot replace a drive in raid5, whether it's there or not that's the biggest thing missing: just no rebuilds in any way 5) you can mount a raid5 with a missing device with -o degraded 6) adding a drive to a degraded arrays will grow the array, not rebuild the missing bits 7) you can remove a drive from an array, add files, and then if you plug the drive in, it apparently gets auto sucked in back in the array. There is no rebuild that happens, you now have an inconsistent array where one drive is not at the same level than the other ones (I lost all files I added after the drive was removed when I added the drive back). In other words, everything seems to work except there is no rebuild that I could see anywhere. Here are all the details: Creation > polgara:/dev/disk/by-id# mkfs.btrfs -f -d raid5 -m raid5 -L backupcopy /dev/mapper/crypt_sd[bdfghijkl]1 > > WARNING! - Btrfs v3.12 IS EXPERIMENTAL > WARNING! - see http://btrfs.wiki.kernel.org before using > > Turning ON incompat feature 'extref': increased hardlink limit per file to 65536 > Turning ON incompat feature 'raid56': raid56 extended format > adding device /dev/mapper/crypt_sdd1 id 2 > adding device /dev/mapper/crypt_sdf1 id 3 > adding device /dev/mapper/crypt_sdg1 id 4 > adding device /dev/mapper/crypt_sdh1 id 5 > adding device /dev/mapper/crypt_sdi1 id 6 > adding device /dev/mapper/crypt_sdj1 id 7 > adding device /dev/mapper/crypt_sdk1 id 8 > adding device /dev/mapper/crypt_sdl1 id 9 > fs created label backupcopy on /dev/mapper/crypt_sdb1 > nodesize 16384 leafsize 16384 sectorsize 4096 size 4.09TiB > polgara:/dev/disk/by-id# mount -L backupcopy /mnt/btrfs_backupcopy/ > polgara:/mnt/btrfs_backupcopy# df -h . > Filesystem Size Used Avail Use% Mounted on > /dev/mapper/crypt_sdb1 4.1T 3.0M 4.1T 1% /mnt/btrfs_backupcopy Let's add one drive > polgara:/mnt/btrfs_backupcopy# btrfs device add -f /dev/mapper/crypt_sdm1 /mnt/btrfs_backupcopy/ > polgara:/mnt/btrfs_backupcopy# df -h . > Filesystem Size Used Avail Use% Mounted on > /dev/mapper/crypt_sdb1 4.6T 3.0M 4.6T 1% /mnt/btrfs_backupcopy Oh look it's bigger now. We need to manual rebalance to use the new drive: > polgara:/mnt/btrfs_backupcopy# btrfs balance start . > Done, had to relocate 6 out of 6 chunks > > polgara:/mnt/btrfs_backupcopy# btrfs device delete /dev/mapper/crypt_sdm1 . > BTRFS info (device dm-9): relocating block group 23314563072 flags 130 > BTRFS info (device dm-9): relocating block group 22106603520 flags 132 > BTRFS info (device dm-9): found 6 extents > BTRFS info (device dm-9): relocating block group 12442927104 flags 129 > BTRFS info (device dm-9): found 1 extents > polgara:/mnt/btrfs_backupcopy# df -h . > Filesystem Size Used Avail Use% Mounted on > /dev/mapper/crypt_sdb1 4.1T 4.7M 4.1T 1% /mnt/btrfs_backupcopy Ah, it's smaller again. Note that it's not degraded, you can just keep removing drives and it'll do a force reblance to fit the data in the remaining drives. Ok, I've unounted the filesystem, and will manually remove a device: > polgara:~# dmsetup remove crypt_sdl1 > polgara:~# mount -L backupcopy /mnt/btrfs_backupcopy/ > mount: wrong fs type, bad option, bad superblock on /dev/mapper/crypt_sdk1, > missing codepage or helper program, or other error > In some cases useful info is found in syslog - try > dmesg | tail or so > BTRFS: open /dev/dm-9 failed > BTRFS info (device dm-7): disk space caching is enabled > BTRFS: failed to read chunk tree on dm-7 > BTRFS: open_ctree failed So a normal mount fails. You have to mount with -o degraded to acknowledge this. > polgara:~# mount -o degraded -L backupcopy /mnt/btrfs_backupcopy/ > BTRFS: device label backupcopy devid 8 transid 50 /dev/mapper/crypt_sdk1 > BTRFS: open /dev/dm-9 failed > BTRFS info (device dm-7): allowing degraded mounts > BTRFS info (device dm-7): disk space caching is enabled Re-adding a device that was missing: > polgara:/mnt/btrfs_backupcopy# cryptsetup luksOpen /dev/sdl1 crypt_sdl1 > Enter passphrase for /dev/sdl1: > polgara:/mnt/btrfs_backupcopy# df -h . > Filesystem Size Used Avail Use% Mounted on > /dev/mapper/crypt_sdb1 4.1T 2.5M 3.7T 1% /mnt/btrfs_backupcopy > polgara:/mnt/btrfs_backupcopy# btrfs device add -f /dev/mapper/crypt_sdl1 /mnt/btrfs_backupcopy/ > /dev/mapper/crypt_sdl1 is mounted > BTRFS: device label backupcopy devid 9 transid 50 /dev/dm-9 > BTRFS: device label backupcopy devid 9 transid 50 /dev/dm-9 => waoh, btrfs noticed that the device came back and knew it was its own, so it slurped it right away (I was not able to add the device because it already was auto-added) Adding another device does grow the size which adding sdl1 did not: > polgara:/mnt/btrfs_backupcopy# btrfs device add -f /dev/mapper/crypt_sdm1 /mnt/btrfs_backupcopy/ > polgara:/mnt/btrfs_backupcopy# df -h . > Filesystem Size Used Avail Use% Mounted on > /dev/mapper/crypt_sdb1 4.6T 2.5M 4.1T 1% /mnt/btrfs_backupcopy Ok, harder, let's pull a drive now. Strangely btrfs doesn't notice right away but logs this eventually: BTRFS: bdev /dev/dm-6 errs: wr 0, rd 0, flush 1, corrupt 0, gen 0 BTRFS: lost page write due to I/O error on /dev/dm-6 BTRFS: bdev /dev/dm-6 errs: wr 1, rd 0, flush 1, corrupt 0, gen 0 BTRFS: lost page write due to I/O error on /dev/dm-6 BTRFS: bdev /dev/dm-6 errs: wr 2, rd 0, flush 1, corrupt 0, gen 0 BTRFS: lost page write due to I/O error on /dev/dm-6 BTRFS: bdev /dev/dm-6 errs: wr 3, rd 0, flush 1, corrupt 0, gen 0 From what I can tell, it buffers the writes to the missing drive and retries them in the background. Technically it is in degraded mode, but it doesn't seem to think so. This is where it now fails, I cannot remove the bad drive from the array: polgara:/mnt/btrfs_backupcopy# btrfs device delete /dev/mapper/crypt_sdj1 . ERROR: error removing the device '/dev/mapper/crypt_sdj1' - Invalid argument Drive replace is not yet implemented: > polgara:/mnt/btrfs_backupcopy# btrfs replace start -r /dev/mapper/crypt_sdj1 /dev/mapper/crypt_sde1 . > quiet_error: 138 callbacks suppressed > Buffer I/O error on device dm-6, logical block 122095344 > Buffer I/O error on device dm-6, logical block 122095364 > Buffer I/O error on device dm-6, logical block 0 > Buffer I/O error on device dm-6, logical block 1 > Buffer I/O error on device dm-6, logical block 122095365 > Buffer I/O error on device dm-6, logical block 122095365 > Buffer I/O error on device dm-6, logical block 122095365 > Buffer I/O error on device dm-6, logical block 122095365 > Buffer I/O error on device dm-6, logical block 122095365 > Buffer I/O error on device dm-6, logical block 122095365 > BTRFS warning (device dm-8): dev_replace cannot yet handle RAID5/RAID6 Adding a device at this point will not help because the filesystem is not in degraded mode, btrfs is still kind of hoping that dm-6 (aka crypt_sdj1) will come back. So if I add a device, it would just grow the raid. Let mount the array in degraded mode: > polgara:~# mount -v -t btrfs -o compress=zlib,space_cache,noatime,degraded LABEL=backupcopy /mnt/btrfs_backupcopy > polgara:~# btrfs fi show > Label: backupcopy uuid: 5ccda389-748b-419c-bfa9-c14c4136e1c4 > Total devices 10 FS bytes used 680.05MiB > devid 1 size 465.76GiB used 1.14GiB path /dev/mapper/crypt_sdb1 > devid 2 size 465.76GiB used 1.14GiB path /dev/dm-1 > devid 3 size 465.75GiB used 1.14GiB path /dev/dm-2 > devid 4 size 465.76GiB used 1.14GiB path /dev/dm-3 > devid 5 size 465.76GiB used 1.14GiB path /dev/dm-4 > devid 6 size 465.76GiB used 1.14GiB path /dev/dm-5 > devid 7 size 465.76GiB used 1.14GiB path /dev/dm-6 > devid 8 size 465.76GiB used 1.14GiB path /dev/mapper/crypt_sdk1 > devid 9 size 465.76GiB used 1.14GiB path /dev/mapper/crypt_sdl1 > devid 10 size 465.76GiB used 1.14GiB path /dev/mapper/crypt_sdm1 > > quiet_error: 250 callbacks suppressed > Buffer I/O error on device dm-6, logical block 122095344 > Buffer I/O error on device dm-6, logical block 122095344 > Buffer I/O error on device dm-6, logical block 122095364 > Buffer I/O error on device dm-6, logical block 122095364 > Buffer I/O error on device dm-6, logical block 0 > Buffer I/O error on device dm-6, logical block 0 > Buffer I/O error on device dm-6, logical block 1 > Buffer I/O error on device dm-6, logical block 122095365 > Buffer I/O error on device dm-6, logical block 122095365 > Buffer I/O error on device dm-6, logical block 122095365 Even though it cannot access dm-6, it still included it in the mount because the device node still exists. Adding a device does not help, it just grew the array in degraded mode: polgara:/mnt/btrfs_backupcopy# btrfs device add /dev/mapper/crypt_sde1 . polgara:/mnt/btrfs_backupcopy# df -h . Filesystem Size Used Avail Use% Mounted on /dev/mapper/crypt_sdb1 5.1T 681M 4.6T 1% /mnt/btrfs_backupcopy Balance is not happy: polgara:/mnt/btrfs_backupcopy# btrfs balance start . > BTRFS info (device dm-8): relocating block group 63026233344 flags 129 > BTRFS info (device dm-8): csum failed ino 257 off 917504 csum 1017609526 expected csum 4264281942 > BTRFS info (device dm-8): csum failed ino 257 off 966656 csum 389256117 expected csum 2901202041 > BTRFS info (device dm-8): csum failed ino 257 off 970752 csum 4107355973 expected csum 3954832285 > BTRFS info (device dm-8): csum failed ino 257 off 974848 csum 1121660380 expected csum 2872112983 > BTRFS info (device dm-8): csum failed ino 257 off 978944 csum 2032023730 expected csum 2250478230 > BTRFS info (device dm-8): csum failed ino 257 off 933888 csum 297434258 expected csum 3687027701 > BTRFS info (device dm-8): csum failed ino 257 off 937984 csum 1176910550 expected csum 3400460732 > BTRFS info (device dm-8): csum failed ino 257 off 942080 csum 366743485 expected csum 2321497660 > BTRFS info (device dm-8): csum failed ino 257 off 946176 csum 1849642521 expected csum 931611495 > BTRFS info (device dm-8): csum failed ino 257 off 921600 csum 1075941372 expected csum 2126420528 ERROR: error during balancing '.' - Input/output error This looks bad, but my filesystem didn't look corrupted after that. I am not allowed to remove the new device I just added: polgara:~# btrfs device delete /dev/mapper/crypt_sde1 . ERROR: error removing the device '/dev/mapper/crypt_sde1' - Inappropriate ioctl for device Let's now remove the device node of that bad drive, unmount and remount the array: polgara:~# dmsetup remove crypt_sdj1 polgara:~# btrfs fi show Label: 'backupcopy' uuid: 5ccda389-748b-419c-bfa9-c14c4136e1c4 Total devices 11 FS bytes used 682.30MiB devid 1 size 465.76GiB used 2.14GiB path /dev/mapper/crypt_sdb1 devid 2 size 465.76GiB used 2.14GiB path /dev/mapper/crypt_sdd1 devid 3 size 465.75GiB used 2.14GiB path /dev/mapper/crypt_sdf1 devid 4 size 465.76GiB used 2.14GiB path /dev/mapper/crypt_sdg1 devid 5 size 465.76GiB used 2.14GiB path /dev/mapper/crypt_sdh1 devid 6 size 465.76GiB used 2.14GiB path /dev/mapper/crypt_sdi1 devid 8 size 465.76GiB used 2.14GiB path /dev/mapper/crypt_sdk1 devid 9 size 465.76GiB used 2.14GiB path /dev/mapper/crypt_sdl1 devid 10 size 465.76GiB used 2.14GiB path /dev/mapper/crypt_sdm1 devid 11 size 465.76GiB used 1.00GiB path /dev/mapper/crypt_sde1 *** Some devices missing => ok, that's good, one device is missing Now when I mount the array, I see this: polgara:~# mount -v -t btrfs -o compress=zlib,space_cache,noatime,degraded LABEL=backupcopy /mnt/btrfs_backupcopy > BTRFS: device label backupcopy devid 11 transid 150 /dev/mapper/crypt_sde1 > BTRFS: open /dev/dm-6 failed > BTRFS info (device dm-10): allowing degraded mounts > BTRFS info (device dm-10): disk space caching is enabled > BTRFS: bdev /dev/dm-6 errs: wr 12, rd 0, flush 4, corrupt 0, gen 0 /dev/mapper/crypt_sde1 on /mnt/btrfs_backupcopy type btrfs (rw,noatime,compress=zlib,space_cache,degraded) polgara:~# btrfs fi show Label: backupcopy uuid: 5ccda389-748b-419c-bfa9-c14c4136e1c4 Total devices 11 FS bytes used 682.30MiB devid 1 size 465.76GiB used 2.14GiB path /dev/dm-0 devid 2 size 465.76GiB used 2.14GiB path /dev/dm-1 devid 3 size 465.75GiB used 2.14GiB path /dev/dm-2 devid 4 size 465.76GiB used 2.14GiB path /dev/dm-3 devid 5 size 465.76GiB used 2.14GiB path /dev/dm-4 devid 6 size 465.76GiB used 2.14GiB path /dev/dm-5 devid 7 size 465.76GiB used 1.14GiB path /dev/dm-6 devid 8 size 465.76GiB used 2.14GiB path /dev/dm-7 devid 9 size 465.76GiB used 2.14GiB path /dev/dm-9 devid 10 size 465.76GiB used 2.14GiB path /dev/dm-8 devid 11 size 465.76GiB used 1.00GiB path /dev/mapper/crypt_sde1 That's bad, it still shows me dm-6 even though it's gone now. I think this means that you cannot get btrfs to show that it's in degraded mode. Ok, let's re-add the device: polgara:/mnt/btrfs_backupcopy# cryptsetup luksOpen /dev/sdj1 crypt_sdj1 Enter passphrase for /dev/sdj1: > BTRFS: device label backupcopy devid 7 transid 137 /dev/dm-6 polgara:/mnt/btrfs_backupcopy# Mar 18 22:30:55 polgara kernel: [49535.076071] BTRFS: device label backupcopy devid 7 transid 137 /dev/dm-6 > btrfs-rmw-2: page allocation failure: order:1, mode:0x8020 > CPU: 0 PID: 7511 Comm: btrfs-rmw-2 Tainted: G W 3.14.0-rc5-amd64-i915-preempt-20140216c #1 > Hardware name: System manufacturer P5KC/P5KC, BIOS 0502 05/24/2007 > 0000000000000000 ffff880011173690 ffffffff816090b3 0000000000000000 > ffff880011173718 ffffffff811037b0 00000001fffffffe 0000000000000001 > ffff88006bb2a0d0 0000000200000000 0000003000000000 ffff88007ff7ce00 > Call Trace: > [<ffffffff816090b3>] dump_stack+0x4e/0x7a > [<ffffffff811037b0>] warn_alloc_failed+0x111/0x125 > [<ffffffff81106cb2>] __alloc_pages_nodemask+0x707/0x854 > [<ffffffff8110654e>] ? get_page_from_freelist+0x6c0/0x71d > [<ffffffff81014650>] dma_generic_alloc_coherent+0xa7/0x11c > [<ffffffff811354e8>] dma_pool_alloc+0x10a/0x1cb > [<ffffffffa00f2aa0>] mvs_task_prep+0x192/0xa42 [mvsas] > [<ffffffff81140d66>] ? ____cache_alloc_node+0xf1/0x134 > [<ffffffffa00f33ad>] mvs_task_exec.isra.9+0x5d/0xc9 [mvsas] > [<ffffffffa00f3a76>] mvs_queue_command+0x3d/0x29b [mvsas] > [<ffffffff8114118d>] ? kmem_cache_alloc+0xe3/0x161 > [<ffffffffa00e5d1c>] sas_ata_qc_issue+0x1cd/0x235 [libsas] > [<ffffffff814a9598>] ata_qc_issue+0x291/0x2f1 > [<ffffffff814af413>] ? ata_scsiop_mode_sense+0x29c/0x29c > [<ffffffff814b049e>] __ata_scsi_queuecmd+0x184/0x1e0 > [<ffffffff814b05a5>] ata_sas_queuecmd+0x31/0x4d > [<ffffffffa00e47ba>] sas_queuecommand+0x98/0x1fe [libsas] > [<ffffffff8148fdee>] scsi_dispatch_cmd+0x14f/0x22e > [<ffffffff814964da>] scsi_request_fn+0x4da/0x507 > [<ffffffff812e01a3>] __blk_run_queue_uncond+0x22/0x2b > [<ffffffff812e01c5>] __blk_run_queue+0x19/0x1b > [<ffffffff812fc16d>] cfq_insert_request+0x391/0x3b5 > [<ffffffff812e002f>] ? perf_trace_block_rq_with_error+0x45/0x14f > [<ffffffff812e512c>] ? blk_recount_segments+0x1e/0x2e > [<ffffffff812dc08c>] __elv_add_request+0x1fc/0x276 > [<ffffffff812e1c6c>] blk_queue_bio+0x237/0x256 > [<ffffffff812df92c>] generic_make_request+0x9c/0xdb > [<ffffffff812dfa7d>] submit_bio+0x112/0x131 > [<ffffffff8128274c>] rmw_work+0x112/0x162 > [<ffffffff8125073f>] worker_loop+0x168/0x4d8 > [<ffffffff812505d7>] ? btrfs_queue_worker+0x283/0x283 > [<ffffffff8106bc56>] kthread+0xae/0xb6 > [<ffffffff8106bba8>] ? __kthread_parkme+0x61/0x61 > [<ffffffff816153fc>] ret_from_fork+0x7c/0xb0 > [<ffffffff8106bba8>] ? __kthread_parkme+0x61/0x61 My system hung soon after that, but it could have been due to issues with my SATA driver too. I rebooted, tried a mount: polgara:~# mount -v -t btrfs -o compress=zlib,space_cache,noatime LABEL=backupcopy /mnt/btrfs_backupcopy > BTRFS: device label backupcopy devid 11 transid 152 /dev/mapper/crypt_sde1 > BTRFS info (device dm-10): disk space caching is enabled > BTRFS: bdev /dev/dm-6 errs: wr 12, rd 0, flush 4, corrupt 0, gen 0 /dev/mapper/crypt_sde1 on /mnt/btrfs_backupcopy type btrfs (rw,noatime,compress=zlib,space_cache) Ok, there is a problem here, my filesystem is missing data I added after my sdj1 device died. In other words, btrfs happily added my device that was way behind and gave me an incomplete fileystem instead of noticing that sdj1 was behind and giving me a degraded filesystem. Moral of the story: do not ever re-add a device that got kicked out if you wrote new data after that, or you will end up with an older version of your filesystem (on the plus side, it's consistent and apparently without data corruption. That said, btrfs scrub complained loudly of many errors it didn't know how to fix. > BTRFS: bdev /dev/dm-6 errs: wr 12, rd 0, flush 4, corrupt 0, gen 0 > BTRFS: bad tree block start 6438453874765710835 61874388992 > BTRFS: bad tree block start 8828340560360071357 61886726144 > BTRFS: bad tree block start 5332618200988957279 61895868416 > BTRFS: bad tree block start 9233018093866324599 61895884800 > BTRFS: bad tree block start 17393001018657664843 61895917568 > BTRFS: bad tree block start 6438453874765710835 61874388992 > BTRFS: bad tree block start 8828340560360071357 61886726144 > BTRFS: bad tree block start 5332618200988957279 61895868416 > BTRFS: bad tree block start 9233018093866324599 61895884800 > BTRFS: bad tree block start 17393001018657664843 61895917568 > BTRFS: checksum error at logical 61826662400 on dev /dev/dm-6, sector 2541568: metadata leaf (level 0) in tree 5 > BTRFS: checksum error at logical 61826662400 on dev /dev/dm-6, sector 2541568: metadata leaf (level 0) in tree 5 > BTRFS: bdev /dev/dm-6 errs: wr 12, rd 0, flush 4, corrupt 1, gen 0 > BTRFS: unable to fixup (regular) error at logical 61826662400 on dev /dev/dm-6 > BTRFS: checksum error at logical 61826678784 on dev /dev/dm-6, sector 2541600: metadata leaf (level 0) in tree 5 > BTRFS: checksum error at logical 61826678784 on dev /dev/dm-6, sector 2541600: metadata leaf (level 0) in tree 5 > BTRFS: bdev /dev/dm-6 errs: wr 12, rd 0, flush 4, corrupt 2, gen 0 > BTRFS: unable to fixup (regular) error at logical 61826678784 on dev /dev/dm-6 > BTRFS: checksum error at logical 61826695168 on dev /dev/dm-6, sector 2541632: metadata leaf (level 0) in tree 5 > BTRFS: checksum error at logical 61826695168 on dev /dev/dm-6, sector 2541632: metadata leaf (level 0) in tree 5 > BTRFS: bdev /dev/dm-6 errs: wr 12, rd 0, flush 4, corrupt 3, gen 0 > BTRFS: unable to fixup (regular) error at logical 61826695168 on dev /dev/dm-6 (...) > BTRFS: unable to fixup (regular) error at logical 61827186688 on dev /dev/dm-5 > scrub_handle_errored_block: 632 callbacks suppressed > BTRFS: checksum error at logical 61849731072 on dev /dev/dm-6, sector 2586624: metadata leaf (level 0) in tree 5 > BTRFS: checksum error at logical 61849731072 on dev /dev/dm-6, sector 2586624: metadata leaf (level 0) in tree 5 > btrfs_dev_stat_print_on_error: 632 callbacks suppressed > BTRFS: bdev /dev/dm-6 errs: wr 12, rd 0, flush 4, corrupt 166, gen 0 > scrub_handle_errored_block: 632 callbacks suppressed > BTRFS: unable to fixup (regular) error at logical 61849731072 on dev /dev/dm-6 (...) > BTRFS: unable to fixup (regular) error at logical 61864853504 on dev /dev/dm-5 > btree_readpage_end_io_hook: 16 callbacks suppressed > BTRFS: bad tree block start 17393001018657664843 61895917568 > BTRFS: bad tree block start 17393001018657664843 61895917568 > scrub_handle_errored_block: 697 callbacks suppressed > BTRFS: checksum error at logical 61871751168 on dev /dev/dm-3, sector 2629632: metadata leaf (level 0) in tree 5 > BTRFS: checksum error at logical 61871751168 on dev /dev/dm-3, sector 2629632: metadata leaf (level 0) in tree 5 > btrfs_dev_stat_print_on_error: 697 callbacks suppressed > BTRFS: bdev /dev/dm-3 errs: wr 0, rd 0, flush 0, corrupt 236, gen 0 > scrub_handle_errored_block: 697 callbacks suppressed > BTRFS: unable to fixup (regular) error at logical 61871751168 on dev /dev/dm-3 On the plus side, I can remove the last drive I added now that I'm not in degraded mode again: polgara:/mnt/btrfs_backupcopy# btrfs device delete /dev/mapper/crypt_sde1 . > BTRFS info (device dm-10): relocating block group 72689909760 flags 129 > BTRFS info (device dm-10): found 1 extents > BTRFS info (device dm-10): found 1 extents There you go, hope this helps. Marc
On Mar 19, 2014, at 12:09 AM, Marc MERLIN <marc@merlins.org> wrote: > > 7) you can remove a drive from an array, add files, and then if you plug > the drive in, it apparently gets auto sucked in back in the array. > There is no rebuild that happens, you now have an inconsistent array where > one drive is not at the same level than the other ones (I lost all files I added > after the drive was removed when I added the drive back). Seems worthy of a dedicated bug report and keeping an eye on in the future, not good. >> >> polgara:/mnt/btrfs_backupcopy# df -h . >> Filesystem Size Used Avail Use% Mounted on >> /dev/mapper/crypt_sdb1 4.1T 3.0M 4.1T 1% /mnt/btrfs_backupcopy > > Let's add one drive >> polgara:/mnt/btrfs_backupcopy# btrfs device add -f /dev/mapper/crypt_sdm1 /mnt/btrfs_backupcopy/ >> polgara:/mnt/btrfs_backupcopy# df -h . >> Filesystem Size Used Avail Use% Mounted on >> /dev/mapper/crypt_sdb1 4.6T 3.0M 4.6T 1% /mnt/btrfs_backupcopy > > Oh look it's bigger now. We need to manual rebalance to use the new drive: You don't have to. As soon as you add the additional drive, newly allocated chunks will stripe across all available drives. e.g. 1 GB allocations striped across 3x drives, if I add a 4th drive, initially any additional writes are only to the first three drives but once a new data chunk is allocated it gets striped across 4 drives. > > In other words, btrfs happily added my device that was way behind and gave me an incomplete fileystem instead of noticing > that sdj1 was behind and giving me a degraded filesystem. > Moral of the story: do not ever re-add a device that got kicked out if you wrote new data after that, or you will end up with an older version of your filesystem (on the plus side, it's consistent and apparently without data corruption. That said, btrfs scrub complained loudly of many errors it didn't know how to fix. Sure the whole thing isn't corrupt. But if anything written while degraded vanishes once the missing device is reattached, and you remount normally (non-degraded), that's data loss. Yikes! > There you go, hope this helps. Yes. Thanks! Chris Murphy-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Mar 19, 2014 at 12:32:55AM -0600, Chris Murphy wrote: > > On Mar 19, 2014, at 12:09 AM, Marc MERLIN <marc@merlins.org> wrote: > > > > 7) you can remove a drive from an array, add files, and then if you plug > > the drive in, it apparently gets auto sucked in back in the array. > > There is no rebuild that happens, you now have an inconsistent array where > > one drive is not at the same level than the other ones (I lost all files I added > > after the drive was removed when I added the drive back). > > Seems worthy of a dedicated bug report and keeping an eye on in the future, not good. Since it's not supposed to be working, I didn't file a bug, but I figured it'd be good for people to know about it in the meantime. > >> polgara:/mnt/btrfs_backupcopy# btrfs device add -f /dev/mapper/crypt_sdm1 /mnt/btrfs_backupcopy/ > >> polgara:/mnt/btrfs_backupcopy# df -h . > >> Filesystem Size Used Avail Use% Mounted on > >> /dev/mapper/crypt_sdb1 4.6T 3.0M 4.6T 1% /mnt/btrfs_backupcopy > > > > Oh look it's bigger now. We need to manual rebalance to use the new drive: > > You don't have to. As soon as you add the additional drive, newly allocated chunks will stripe across all available drives. e.g. 1 GB allocations striped across 3x drives, if I add a 4th drive, initially any additional writes are only to the first three drives but once a new data chunk is allocated it gets striped across 4 drives. That's the thing though. If the bad device hadn't been forcibly removed, and apparently the only way to do this was to unmount, make the device node disappear, and remount in degraded mode, it looked to me like btrfs was still consideing that the drive was part of the array and trying to write to it. After adding a drive, I couldn't quite tell if it was striping over 11 drive2 or 10, but it felt that at least at times, it was striping over 11 drives with write failures on the missing drive. I can't prove it, but I'm thinking the new data I was writing was being striped in degraded mode. > Sure the whole thing isn't corrupt. But if anything written while degraded vanishes once the missing device is reattached, and you remount normally (non-degraded), that's data loss. Yikes! Yes, although it's limited, you apparently only lose new data that was added after you went into degraded mode and only if you add another drive where you write more data. In real life this shouldn't be too common, even if it is indeed a bug. Cheers, Marc
On Mar 19, 2014, at 9:40 AM, Marc MERLIN <marc@merlins.org> wrote: > > After adding a drive, I couldn't quite tell if it was striping over 11 > drive2 or 10, but it felt that at least at times, it was striping over 11 > drives with write failures on the missing drive. > I can't prove it, but I'm thinking the new data I was writing was being > striped in degraded mode. Well it does sound fragile after all to add a drive to a degraded array, especially when it's not expressly treating the faulty drive as faulty. I think iotop will show what block devices are being written to. And in a VM it's easy (albeit rudimentary) with sparse files, as you can see them grow. > > Yes, although it's limited, you apparently only lose new data that was added > after you went into degraded mode and only if you add another drive where > you write more data. > In real life this shouldn't be too common, even if it is indeed a bug. It's entirely plausible a drive power/data cable becomes lose, runs for hours degraded before the wayward device is reseated. It'll be common enough. It's definitely not OK for all of that data in the interim to vanish just because the volume has resumed from degraded to normal. Two states of data, normal vs degraded, is scary. It sounds like totally silent data loss. So yeah if it's reproducible it's worthy of a separate bug. Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Mar 19, 2014 at 10:53:33AM -0600, Chris Murphy wrote: > > Yes, although it's limited, you apparently only lose new data that was added > > after you went into degraded mode and only if you add another drive where > > you write more data. > > In real life this shouldn't be too common, even if it is indeed a bug. > > It's entirely plausible a drive power/data cable becomes lose, runs for hours degraded before the wayward device is reseated. It'll be common enough. It's definitely not OK for all of that data in the interim to vanish just because the volume has resumed from degraded to normal. Two states of data, normal vs degraded, is scary. It sounds like totally silent data loss. So yeah if it's reproducible it's worthy of a separate bug. Actually what I did is more complex, I first added a drive to a degraded array, and then re-added the drive that had been removed. I don't know if re-adding the same drive that was removed would cause the bug I saw. For now, my array is back to actually trying to store the backup I had meant for it, and the drives seems stable now that I fixed the power issue. Does someone else want to try? :) Marc
On Thu, Mar 20, 2014 at 01:44:20AM +0100, Tobias Holst wrote: > I tried the RAID6 implementation of btrfs and I looks like I had the > same problem. Rebuild with "balance" worked but when a drive was > removed when mounted and then readded, the chaos began. I tried it a > few times. So when a drive fails (and this is just because of > connection lost or similar non severe problems), then it is necessary > to wipe the disc first before readding it, so btrfs will add it as a > new disk and not try to readd the old one. Good to know you got this too. Just to confirm: did you get it to rebuild, or once a drive is lost/gets behind, you're in degraded mode forever for those blocks? Or were you able to balance? Marc
Marc MERLIN posted on Wed, 19 Mar 2014 08:40:31 -0700 as excerpted: > That's the thing though. If the bad device hadn't been forcibly removed, > and apparently the only way to do this was to unmount, make the device > node disappear, and remount in degraded mode, it looked to me like btrfs > was still consideing that the drive was part of the array and trying to > write to it. > After adding a drive, I couldn't quite tell if it was striping over 11 > drive2 or 10, but it felt that at least at times, it was striping over > 11 drives with write failures on the missing drive. > I can't prove it, but I'm thinking the new data I was writing was being > striped in degraded mode. FWIW, there's at least two problems here, one a bug (or perhaps it'd more accurately be described as an as yet incomplete feature) unrelated to btrfs raid5/6 mode, the other the incomplete raid5/6 support. Both are known issues, however. The incomplete raid5/6 is discussed well enough elsewhere including in this thread as a whole, which leaves the other issue. The other issue, not specifically raid5/6 mode related, is that currently, in-kernel btrfs is basically oblivious to disappearing drives, thus explaining some of the more complex bits of the behavior you described. Yes, the kernel has the device data and other layers know when a device goes missing, but it's basically a case of the right hand not knowing what the left hand is doing -- once setup on a set of devices, in-kernel btrfs basically doesn't do anything with the device information available to it, at least in terms of removing a device from its listing when it goes missing. (It does seem to transparently handle a missing btrfs component device reappearing, arguably /too/ transparently!) Basically all btrfs does is log errors when a component device disappears. It doesn't do anything with the disappeared device, and really doesn't "know" it has disappeared at all, until an unmount and (possibly degraded) remount, at which point it re-enumerates the devices and again knows what's actually there... until a device disappears again. There's actually patches being worked on to fix that situation as we speak, and it's possible they're actually in btrfs-next already. (I've seen the patches and discussion go by on the list but haven't tracked them to the extent that I know current status, other than that they're not in mainline yet.) Meanwhile, counter-intuitively, btrfs-userspace is sometimes more aware of current device status than btrfs-kernel is ATM, since parts of userspace actually either get current status from the kernel, or trigger a rescan in ordered to get it. But even after a rescan updates what userspace knows and thus what the kernel as a whole knows, btrfs-kernel still doesn't actually use that new information available to it in the same kernel that btrfs-userspace used to get it from! Knowing that rather counterintuitive "little" inconsistency, that isn't actually so little, goes quite a way toward explaining what otherwise looks like illogical btrfs behavior -- how could kernel-btrfs not know the status of its own devices?
I think after the balance it was a fine, non-degraded RAID again... As far as I remember. Tobby 2014-03-20 1:46 GMT+01:00 Marc MERLIN <marc@merlins.org>: > > On Thu, Mar 20, 2014 at 01:44:20AM +0100, Tobias Holst wrote: > > I tried the RAID6 implementation of btrfs and I looks like I had the > > same problem. Rebuild with "balance" worked but when a drive was > > removed when mounted and then readded, the chaos began. I tried it a > > few times. So when a drive fails (and this is just because of > > connection lost or similar non severe problems), then it is necessary > > to wipe the disc first before readding it, so btrfs will add it as a > > new disk and not try to readd the old one. > > Good to know you got this too. > > Just to confirm: did you get it to rebuild, or once a drive is lost/gets > behind, you're in degraded mode forever for those blocks? > > Or were you able to balance? > > Marc > -- > "A mouse is a device used to point at the xterm you want to type in" - A.S.R. > Microsoft is to operating systems .... > .... what McDonalds is to gourmet cooking > Home page: http://marc.merlins.org/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Mar 19, 2014 at 10:53:33AM -0600, Chris Murphy wrote: > > On Mar 19, 2014, at 9:40 AM, Marc MERLIN <marc@merlins.org> wrote: > > > > After adding a drive, I couldn't quite tell if it was striping over 11 > > drive2 or 10, but it felt that at least at times, it was striping over 11 > > drives with write failures on the missing drive. > > I can't prove it, but I'm thinking the new data I was writing was being > > striped in degraded mode. > > Well it does sound fragile after all to add a drive to a degraded array, especially when it's not expressly treating the faulty drive as faulty. I think iotop will show what block devices are being written to. And in a VM it's easy (albeit rudimentary) with sparse files, as you can see them grow. > > > > > Yes, although it's limited, you apparently only lose new data that was added > > after you went into degraded mode and only if you add another drive where > > you write more data. > > In real life this shouldn't be too common, even if it is indeed a bug. > > It's entirely plausible a drive power/data cable becomes lose, runs for hours degraded before the wayward device is reseated. It'll be common enough. It's definitely not OK for all of that data in the interim to vanish just because the volume has resumed from degraded to normal. Two states of data, normal vs degraded, is scary. It sounds like totally silent data loss. So yeah if it's reproducible it's worthy of a separate bug. I just got around to filing that bug: https://bugzilla.kernel.org/show_bug.cgi?id=72811 In other news, I was able to 1) remove a drive 2) mount degraded 3) add a new drive 4) rebalance (that took 2 days with little data, 4 deadlocks and reboots though) 5) remove the missing drive from the filesystem 6) remount the array without -o degraded Now, I'm testing 1) add a new drive 2 remove a working drive 3) automatic rebalance from #2 should rebuild on the new drive automatically Marc
diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c index 6463691..d869079 100644 --- a/fs/btrfs/send.c +++ b/fs/btrfs/send.c @@ -3184,12 +3184,12 @@ static int wait_for_parent_move(struct send_ctx *sctx, struct fs_path *path_after = NULL; int len1, len2; - if (parent_ref->dir <= sctx->cur_ino) - return 0; - if (is_waiting_for_move(sctx, ino)) return 1; + if (parent_ref->dir <= sctx->cur_ino) + return 0; + ret = get_inode_info(sctx->parent_root, ino, NULL, &old_gen, NULL, NULL, NULL, NULL); if (ret == -ENOENT)
It's possible to change the parent/child relationship between directories in such a way that if a child directory has a higher inode number than its parent, it doesn't necessarily means the child rename/move operation can be performed immediately. The parent migth have its own rename/move operation delayed, therefore in this case the child needs to have its rename/move operation delayed too, and be performed after its new parent's rename/move. Steps to reproduce the issue: $ umount /mnt $ mkfs.btrfs -f /dev/sdd $ mount /dev/sdd /mnt $ mkdir /mnt/A $ mkdir /mnt/B $ mkdir /mnt/C $ mv /mnt/C /mnt/A $ mv /mnt/B /mnt/A/C $ mkdir /mnt/A/C/D $ btrfs subvolume snapshot -r /mnt /mnt/snap1 $ btrfs send /mnt/snap1 -f /tmp/base.send $ mv /mnt/A/C/D /mnt/A/D2 $ mv /mnt/A/C/B /mnt/A/D2/B2 $ mv /mnt/A/C /mnt/A/D2/B2/C2 $ btrfs subvolume snapshot -r /mnt /mnt/snap2 $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/incremental.send The incremental send caused the kernel code to enter an infinite loop when building the path string for directory C after its references are processed. The necessary conditions here are that C has an inode number higher than both A and B, and B as an higher inode number higher than A, and D has the highest inode number, that is: inode_number(A) < inode_number(B) < inode_number(C) < inode_number(D) The same issue could happen if after the first snapshot there's any number of intermediary parent directories between A2 and B2, and between B2 and C2. A test case for xfstests follows, covering this simple case and more advanced ones, with files and hard links created inside the directories. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> --- V2: Right version of the patch. Previously sent came from the wrong vm. V3: The condition needed to check already existed, so just moved it to the top, instead of adding it again. fs/btrfs/send.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)