Message ID | 15fc8f8d002e4ffcdb46e769736f240ae7ace20b.1442839332.git.zhaolei@cn.fujitsu.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Mon, Sep 21, 2015 at 1:59 PM, Zhao Lei <zhaolei@cn.fujitsu.com> wrote: > btrfs in v4.3-rc1 failed many xfstests items with '-o nospace_cache' > mount option. > > Failed cases are: > btrfs/008,016,019,020,026,027,028,029,031,041,046,048,050,051,053,054, > 077,083,084,087,092,094 Hi Zhao, So far I tried a few of those against Chris' integration-4.3 branch (same btrfs code as 4.3-rc1): MOUNT_OPTIONS="-o nospace_cache" ./check btrfs/008 btrfs/016 btrfs/019 btrfs/020 FSTYP -- btrfs PLATFORM -- Linux/x86_64 debian3 4.2.0-rc5-btrfs-next-12+ MKFS_OPTIONS -- /dev/sdc MOUNT_OPTIONS -- -o nospace_cache /dev/sdc /home/fdmanana/btrfs-tests/scratch_1 btrfs/008 2s ... 1s btrfs/016 4s ... 3s btrfs/019 4s ... 2s btrfs/020 2s ... 1s Ran: btrfs/008 btrfs/016 btrfs/019 btrfs/020 Passed all 4 tests And none of the tests failed... > generic/004,010,014,023,024,074,075,080,086,087,089,091,092,100,112,123, > 124,125,126,127,131,133,192,193,198,207,208,209,213,214,215,228,239,240, > 246,247,248,255,263,285,306,313,316,323 > > We can reproduce this bug with following simple command: > TEST_DEV=/dev/vdh > TEST_DIR=/mnt/tmp > > umount "$TEST_DEV" >/dev/null > mkfs.btrfs -f "$TEST_DEV" > mount "$TEST_DEV" "$TEST_DIR" > > umount "$TEST_DEV" > mount "$TEST_DEV" "$TEST_DIR" > > cp /bin/bash $TEST_DIR > > Result is: > (omit previous commands) > # cp /bin/bash $TEST_DIR > cp: writing `/mnt/tmp/bash': No space left on device > > By bisect, we can see it is triggered by patch titled: > commit e44163e17796 > ("btrfs: explictly delete unused block groups in close_ctree and ro-remount") > > But the wrong code is not in above patch, btrfs delete all chunks > if no data in filesystem, and above patch just make it obviously. > > Detail reason: > 1: mkfs a blank filesystem, or delete everything in filesystem > 2: umount fs > (current code will delete all data chunks) > 3: mount fs > Because no any data chunks, data's space_cache have no chance > to init, it means: space_info->total_bytes == 0, and > space_info->full == 1. Right, and that's the problem. When the space_info is initialized it should never be flagged as full, otherwise any buffered write attempts fail immediately with enospc instead of trying to allocate a data block group (at extent-tree.c:btrfs_check_data_free_space()). That was fixed recently by: https://patchwork.kernel.org/patch/7133451/ (with a respective test too, https://patchwork.kernel.org/patch/7133471/) > 4: do some write > Current code will ignore chunk allocate because space_info->full, > and return -ENOSPC. > > Fix: > Don't auto-delete last blockgroup for a raid type. > If we delete all blockgroup for a raidtype, it not only cause above bug, > but also may change filesystem to all-single in some case. I don't get this. Can you mention in which cases that happens and why (in the commit message)? It isn't clear when reading the patch why we need to keep at least one block of each type/profile, and seems to be a workaround for a different problem. thanks > > Test: > Test by above script, and confirmed the logic by debug output. > > Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> > --- > fs/btrfs/extent-tree.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > index 5411f0a..35cf7eb 100644 > --- a/fs/btrfs/extent-tree.c > +++ b/fs/btrfs/extent-tree.c > @@ -10012,7 +10012,8 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info) > bg_list); > space_info = block_group->space_info; > list_del_init(&block_group->bg_list); > - if (ret || btrfs_mixed_space_info(space_info)) { > + if (ret || btrfs_mixed_space_info(space_info) || > + block_group->list.next == block_group->list.prev) { > btrfs_put_block_group(block_group); > continue; > } > -- > 1.8.5.1 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Sep 21, 2015 at 2:27 PM, Filipe David Manana <fdmanana@gmail.com> wrote: > On Mon, Sep 21, 2015 at 1:59 PM, Zhao Lei <zhaolei@cn.fujitsu.com> wrote: >> btrfs in v4.3-rc1 failed many xfstests items with '-o nospace_cache' >> mount option. >> >> Failed cases are: >> btrfs/008,016,019,020,026,027,028,029,031,041,046,048,050,051,053,054, >> 077,083,084,087,092,094 > > Hi Zhao, > > So far I tried a few of those against Chris' integration-4.3 branch > (same btrfs code as 4.3-rc1): > > MOUNT_OPTIONS="-o nospace_cache" ./check btrfs/008 btrfs/016 btrfs/019 btrfs/020 > FSTYP -- btrfs > PLATFORM -- Linux/x86_64 debian3 4.2.0-rc5-btrfs-next-12+ > MKFS_OPTIONS -- /dev/sdc > MOUNT_OPTIONS -- -o nospace_cache /dev/sdc /home/fdmanana/btrfs-tests/scratch_1 > > btrfs/008 2s ... 1s > btrfs/016 4s ... 3s > btrfs/019 4s ... 2s > btrfs/020 2s ... 1s > Ran: btrfs/008 btrfs/016 btrfs/019 btrfs/020 > Passed all 4 tests > > And none of the tests failed... > >> generic/004,010,014,023,024,074,075,080,086,087,089,091,092,100,112,123, >> 124,125,126,127,131,133,192,193,198,207,208,209,213,214,215,228,239,240, >> 246,247,248,255,263,285,306,313,316,323 >> >> We can reproduce this bug with following simple command: >> TEST_DEV=/dev/vdh >> TEST_DIR=/mnt/tmp >> >> umount "$TEST_DEV" >/dev/null >> mkfs.btrfs -f "$TEST_DEV" >> mount "$TEST_DEV" "$TEST_DIR" >> >> umount "$TEST_DEV" >> mount "$TEST_DEV" "$TEST_DIR" >> >> cp /bin/bash $TEST_DIR >> >> Result is: >> (omit previous commands) >> # cp /bin/bash $TEST_DIR >> cp: writing `/mnt/tmp/bash': No space left on device >> >> By bisect, we can see it is triggered by patch titled: >> commit e44163e17796 >> ("btrfs: explictly delete unused block groups in close_ctree and ro-remount") >> >> But the wrong code is not in above patch, btrfs delete all chunks >> if no data in filesystem, and above patch just make it obviously. >> >> Detail reason: >> 1: mkfs a blank filesystem, or delete everything in filesystem >> 2: umount fs >> (current code will delete all data chunks) >> 3: mount fs >> Because no any data chunks, data's space_cache have no chance >> to init, it means: space_info->total_bytes == 0, and >> space_info->full == 1. > > Right, and that's the problem. When the space_info is initialized it > should never be flagged as full, otherwise any buffered write attempts > fail immediately with enospc instead of trying to allocate a data > block group (at extent-tree.c:btrfs_check_data_free_space()). > > That was fixed recently by: > > https://patchwork.kernel.org/patch/7133451/ > > (with a respective test too, https://patchwork.kernel.org/patch/7133471/) > >> 4: do some write >> Current code will ignore chunk allocate because space_info->full, >> and return -ENOSPC. >> >> Fix: >> Don't auto-delete last blockgroup for a raid type. >> If we delete all blockgroup for a raidtype, it not only cause above bug, >> but also may change filesystem to all-single in some case. > > I don't get this. Can you mention in which cases that happens and why > (in the commit message)? > > It isn't clear when reading the patch why we need to keep at least one > block of each type/profile, and seems to be a workaround for a > different problem. Plus it would be a bad fix for such a problem, as anyone can still trigger deletion of the last block group via a balance operation (like in the test at https://patchwork.kernel.org/patch/7133471/), i.e., preventing deletion by the cleaner kthread is not enough to guarantee the last block group of a kind isn't deleted... > > thanks > >> >> Test: >> Test by above script, and confirmed the logic by debug output. >> >> Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> >> --- >> fs/btrfs/extent-tree.c | 3 ++- >> 1 file changed, 2 insertions(+), 1 deletion(-) >> >> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c >> index 5411f0a..35cf7eb 100644 >> --- a/fs/btrfs/extent-tree.c >> +++ b/fs/btrfs/extent-tree.c >> @@ -10012,7 +10012,8 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info) >> bg_list); >> space_info = block_group->space_info; >> list_del_init(&block_group->bg_list); >> - if (ret || btrfs_mixed_space_info(space_info)) { >> + if (ret || btrfs_mixed_space_info(space_info) || >> + block_group->list.next == block_group->list.prev) { >> btrfs_put_block_group(block_group); >> continue; >> } >> -- >> 1.8.5.1 >> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > -- > Filipe David Manana, > > "Reasonable men adapt themselves to the world. > Unreasonable men adapt the world to themselves. > That's why all progress depends on unreasonable men."
Hi, Filipe David Manana Thanks for review this patch. > -----Original Message----- > From: Filipe David Manana [mailto:fdmanana@gmail.com] > Sent: Monday, September 21, 2015 9:27 PM > To: Zhao Lei <zhaolei@cn.fujitsu.com> > Cc: linux-btrfs@vger.kernel.org > Subject: Re: [PATCH] btrfs: Fix no space bug caused by removing bg > > On Mon, Sep 21, 2015 at 1:59 PM, Zhao Lei <zhaolei@cn.fujitsu.com> wrote: > > btrfs in v4.3-rc1 failed many xfstests items with '-o nospace_cache' > > mount option. > > > > Failed cases are: > > > > btrfs/008,016,019,020,026,027,028,029,031,041,046,048,050,051,053,054, > > 077,083,084,087,092,094 > > Hi Zhao, > > So far I tried a few of those against Chris' integration-4.3 branch (same btrfs > code as 4.3-rc1): > > MOUNT_OPTIONS="-o nospace_cache" ./check btrfs/008 btrfs/016 btrfs/019 > btrfs/020 > FSTYP -- btrfs > PLATFORM -- Linux/x86_64 debian3 4.2.0-rc5-btrfs-next-12+ > MKFS_OPTIONS -- /dev/sdc > MOUNT_OPTIONS -- -o nospace_cache /dev/sdc > /home/fdmanana/btrfs-tests/scratch_1 > > btrfs/008 2s ... 1s > btrfs/016 4s ... 3s > btrfs/019 4s ... 2s > btrfs/020 2s ... 1s > Ran: btrfs/008 btrfs/016 btrfs/019 btrfs/020 Passed all 4 tests > > And none of the tests failed... > Sorry I hadn't paste detail of my test command. It is from a coincidence operation which is some different with standard steps(as yours), I mount fs with -o no_space_cache manually without set MOUNT_OPT, then xfstests entered into a special path, and triggered the bug: export TEST_DEV='/dev/sdb5' export TEST_DIR='/var/ltf/tester/mnt' mkdir -p '/var/ltf/tester/mnt' export SCRATCH_DEV_POOL='/dev/sdb6 /dev/sdb7 /dev/sdb8 /dev/sdb9 /dev/sdb10 /dev/sdb11' export SCRATCH_MNT='/var/ltf/tester/scratch_mnt' mkdir -p '/var/ltf/tester/scratch_mnt' export DIFF_LENGTH=0 mkfs.btrfs -f "$TEST_DEV" mount -o nospace_cache "$TEST_DEV" "$TEST_DIR" ./check generic/014 Result: FSTYP -- btrfs PLATFORM -- Linux/x86_64 lenovo 4.3.0-rc2_HEAD_1f93e4a96c9109378204c147b3eec0d0e8100fde_ MKFS_OPTIONS -- /dev/sdb6 MOUNT_OPTIONS -- /dev/sdb6 /var/ltf/tester/scratch_mnt generic/014 0s ... - output mismatch (see /var/lib/xfstests/results//generic/014.out.bad) --- tests/generic/014.out 2015-09-22 17:46:13.855391451 +0800 +++ /var/lib/xfstests/results//generic/014.out.bad 2015-09-22 17:57:06.446095748 +0800 @@ -3,4 +3,5 @@ ------ test 1 ------ -OK +truncfile returned 1 : "write: No space left on device +Seed = 1442915826 (use "-s 1442915826" to re-execute this test)" Ran: generic/014 Failures: generic/014 Failed 1 of 1 tests And following script is from trace result of above test. Maybe I can remove the xfstest description because it is not standard steps. > > > > generic/004,010,014,023,024,074,075,080,086,087,089,091,092,100,112,12 > > 3, > > 124,125,126,127,131,133,192,193,198,207,208,209,213,214,215,228,239,24 > > 0, > > 246,247,248,255,263,285,306,313,316,323 > > > > We can reproduce this bug with following simple command: > > TEST_DEV=/dev/vdh > > TEST_DIR=/mnt/tmp > > > > umount "$TEST_DEV" >/dev/null > > mkfs.btrfs -f "$TEST_DEV" > > mount "$TEST_DEV" "$TEST_DIR" > > > > umount "$TEST_DEV" > > mount "$TEST_DEV" "$TEST_DIR" > > > > cp /bin/bash $TEST_DIR > > > > Result is: > > (omit previous commands) > > # cp /bin/bash $TEST_DIR > > cp: writing `/mnt/tmp/bash': No space left on device > > > > By bisect, we can see it is triggered by patch titled: > > commit e44163e17796 > > ("btrfs: explictly delete unused block groups in close_ctree and > > ro-remount") > > > > But the wrong code is not in above patch, btrfs delete all chunks if > > no data in filesystem, and above patch just make it obviously. > > > > Detail reason: > > 1: mkfs a blank filesystem, or delete everything in filesystem > > 2: umount fs > > (current code will delete all data chunks) > > 3: mount fs > > Because no any data chunks, data's space_cache have no chance > > to init, it means: space_info->total_bytes == 0, and > > space_info->full == 1. > > Right, and that's the problem. When the space_info is initialized it should never > be flagged as full, otherwise any buffered write attempts fail immediately with > enospc instead of trying to allocate a data block group (at > extent-tree.c:btrfs_check_data_free_space()). > > That was fixed recently by: > > https://patchwork.kernel.org/patch/7133451/ > > (with a respective test too, https://patchwork.kernel.org/patch/7133471/) > It can fix problem in mount, but can not fix problem of "raid-level change", please see below. > > 4: do some write > > Current code will ignore chunk allocate because space_info->full, > > and return -ENOSPC. > > > > Fix: > > Don't auto-delete last blockgroup for a raid type. > > If we delete all blockgroup for a raidtype, it not only cause above > > bug, but also may change filesystem to all-single in some case. > > I don't get this. Can you mention in which cases that happens and why (in the > commit message)? > > It isn't clear when reading the patch why we need to keep at least one block of > each type/profile, and seems to be a workaround for a different problem. > Simply speaking, if we run following command after apply your patch: TEST_DEV=(/dev/vdg /dev/vdh) TEST_DIR=/mnt/tmp umount "$TEST_DEV" >/dev/null mkfs.btrfs -f -d raid1 "${TEST_DEV[@]}" mount -o nospace_cache "$TEST_DEV" "$TEST_DIR" umount "$TEST_DEV" mount -o nospace_cache "$TEST_DEV" "$TEST_DIR" btrfs filesystem usage $TEST_DIR The result is: # btrfs filesystem usage $TEST_DIR (omit) Data,single: Size:8.00MiB, Used:0.00B /dev/vdg 8.00MiB ... We can see data chunk is changed from raid1 to single, because if we delete all data chunks before mount, there are raid-type information in filesystem, and btrfs will use raid-type of "0x0" for new data chunk after your patch. So, leave at least one data chunk is a simple workaround for above two bug. Thanks Zhaolei > thanks > > > > > Test: > > Test by above script, and confirmed the logic by debug output. > > > > Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> > > --- > > fs/btrfs/extent-tree.c | 3 ++- > > 1 file changed, 2 insertions(+), 1 deletion(-) > > > > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index > > 5411f0a..35cf7eb 100644 > > --- a/fs/btrfs/extent-tree.c > > +++ b/fs/btrfs/extent-tree.c > > @@ -10012,7 +10012,8 @@ void btrfs_delete_unused_bgs(struct > btrfs_fs_info *fs_info) > > bg_list); > > space_info = block_group->space_info; > > list_del_init(&block_group->bg_list); > > - if (ret || btrfs_mixed_space_info(space_info)) { > > + if (ret || btrfs_mixed_space_info(space_info) || > > + block_group->list.next == block_group->list.prev) > > + { > > btrfs_put_block_group(block_group); > > continue; > > } > > -- > > 1.8.5.1 > > > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" > > in the body of a message to majordomo@vger.kernel.org More majordomo > > info at http://vger.kernel.org/majordomo-info.html > > > > -- > Filipe David Manana, > > "Reasonable men adapt themselves to the world. > Unreasonable men adapt the world to themselves. > That's why all progress depends on unreasonable men." -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Sep 22, 2015 at 11:06 AM, Zhao Lei <zhaolei@cn.fujitsu.com> wrote: > Hi, Filipe David Manana > > Thanks for review this patch. > >> -----Original Message----- >> From: Filipe David Manana [mailto:fdmanana@gmail.com] >> Sent: Monday, September 21, 2015 9:27 PM >> To: Zhao Lei <zhaolei@cn.fujitsu.com> >> Cc: linux-btrfs@vger.kernel.org >> Subject: Re: [PATCH] btrfs: Fix no space bug caused by removing bg >> >> On Mon, Sep 21, 2015 at 1:59 PM, Zhao Lei <zhaolei@cn.fujitsu.com> wrote: >> > btrfs in v4.3-rc1 failed many xfstests items with '-o nospace_cache' >> > mount option. >> > >> > Failed cases are: >> > >> > btrfs/008,016,019,020,026,027,028,029,031,041,046,048,050,051,053,054, >> > 077,083,084,087,092,094 >> >> Hi Zhao, >> >> So far I tried a few of those against Chris' integration-4.3 branch (same btrfs >> code as 4.3-rc1): >> >> MOUNT_OPTIONS="-o nospace_cache" ./check btrfs/008 btrfs/016 btrfs/019 >> btrfs/020 >> FSTYP -- btrfs >> PLATFORM -- Linux/x86_64 debian3 4.2.0-rc5-btrfs-next-12+ >> MKFS_OPTIONS -- /dev/sdc >> MOUNT_OPTIONS -- -o nospace_cache /dev/sdc >> /home/fdmanana/btrfs-tests/scratch_1 >> >> btrfs/008 2s ... 1s >> btrfs/016 4s ... 3s >> btrfs/019 4s ... 2s >> btrfs/020 2s ... 1s >> Ran: btrfs/008 btrfs/016 btrfs/019 btrfs/020 Passed all 4 tests >> >> And none of the tests failed... >> > Sorry I hadn't paste detail of my test command. > > It is from a coincidence operation which is some different with standard > steps(as yours), I mount fs with -o no_space_cache manually without set > MOUNT_OPT, then xfstests entered into a special path, and triggered the bug: > export TEST_DEV='/dev/sdb5' > export TEST_DIR='/var/ltf/tester/mnt' > mkdir -p '/var/ltf/tester/mnt' > > export SCRATCH_DEV_POOL='/dev/sdb6 /dev/sdb7 /dev/sdb8 /dev/sdb9 /dev/sdb10 /dev/sdb11' > export SCRATCH_MNT='/var/ltf/tester/scratch_mnt' > mkdir -p '/var/ltf/tester/scratch_mnt' > > export DIFF_LENGTH=0 > > mkfs.btrfs -f "$TEST_DEV" > mount -o nospace_cache "$TEST_DEV" "$TEST_DIR" > > ./check generic/014 > > Result: > FSTYP -- btrfs > PLATFORM -- Linux/x86_64 lenovo 4.3.0-rc2_HEAD_1f93e4a96c9109378204c147b3eec0d0e8100fde_ > MKFS_OPTIONS -- /dev/sdb6 > MOUNT_OPTIONS -- /dev/sdb6 /var/ltf/tester/scratch_mnt > > generic/014 0s ... - output mismatch (see /var/lib/xfstests/results//generic/014.out.bad) > --- tests/generic/014.out 2015-09-22 17:46:13.855391451 +0800 > +++ /var/lib/xfstests/results//generic/014.out.bad 2015-09-22 17:57:06.446095748 +0800 > @@ -3,4 +3,5 @@ > ------ > test 1 > ------ > -OK > +truncfile returned 1 : "write: No space left on device > +Seed = 1442915826 (use "-s 1442915826" to re-execute this test)" > Ran: generic/014 > Failures: generic/014 > Failed 1 of 1 tests > > And following script is from trace result of above test. > Maybe I can remove the xfstest description because it is not standard steps. > >> > >> > generic/004,010,014,023,024,074,075,080,086,087,089,091,092,100,112,12 >> > 3, >> > 124,125,126,127,131,133,192,193,198,207,208,209,213,214,215,228,239,24 >> > 0, >> > 246,247,248,255,263,285,306,313,316,323 >> > >> > We can reproduce this bug with following simple command: >> > TEST_DEV=/dev/vdh >> > TEST_DIR=/mnt/tmp >> > >> > umount "$TEST_DEV" >/dev/null >> > mkfs.btrfs -f "$TEST_DEV" >> > mount "$TEST_DEV" "$TEST_DIR" >> > >> > umount "$TEST_DEV" >> > mount "$TEST_DEV" "$TEST_DIR" >> > >> > cp /bin/bash $TEST_DIR >> > >> > Result is: >> > (omit previous commands) >> > # cp /bin/bash $TEST_DIR >> > cp: writing `/mnt/tmp/bash': No space left on device >> > >> > By bisect, we can see it is triggered by patch titled: >> > commit e44163e17796 >> > ("btrfs: explictly delete unused block groups in close_ctree and >> > ro-remount") >> > >> > But the wrong code is not in above patch, btrfs delete all chunks if >> > no data in filesystem, and above patch just make it obviously. >> > >> > Detail reason: >> > 1: mkfs a blank filesystem, or delete everything in filesystem >> > 2: umount fs >> > (current code will delete all data chunks) >> > 3: mount fs >> > Because no any data chunks, data's space_cache have no chance >> > to init, it means: space_info->total_bytes == 0, and >> > space_info->full == 1. >> >> Right, and that's the problem. When the space_info is initialized it should never >> be flagged as full, otherwise any buffered write attempts fail immediately with >> enospc instead of trying to allocate a data block group (at >> extent-tree.c:btrfs_check_data_free_space()). >> >> That was fixed recently by: >> >> https://patchwork.kernel.org/patch/7133451/ >> >> (with a respective test too, https://patchwork.kernel.org/patch/7133471/) >> > > It can fix problem in mount, but can not fix problem of "raid-level change", > please see below. > >> > 4: do some write >> > Current code will ignore chunk allocate because space_info->full, >> > and return -ENOSPC. >> > >> > Fix: >> > Don't auto-delete last blockgroup for a raid type. >> > If we delete all blockgroup for a raidtype, it not only cause above >> > bug, but also may change filesystem to all-single in some case. >> >> I don't get this. Can you mention in which cases that happens and why (in the >> commit message)? >> >> It isn't clear when reading the patch why we need to keep at least one block of >> each type/profile, and seems to be a workaround for a different problem. >> > Simply speaking, if we run following command after apply your patch: > > TEST_DEV=(/dev/vdg /dev/vdh) > TEST_DIR=/mnt/tmp > > umount "$TEST_DEV" >/dev/null > mkfs.btrfs -f -d raid1 "${TEST_DEV[@]}" > mount -o nospace_cache "$TEST_DEV" "$TEST_DIR" > > umount "$TEST_DEV" > mount -o nospace_cache "$TEST_DEV" "$TEST_DIR" > > btrfs filesystem usage $TEST_DIR > > The result is: > # btrfs filesystem usage $TEST_DIR > (omit) > Data,single: Size:8.00MiB, Used:0.00B > /dev/vdg 8.00MiB > ... > > We can see data chunk is changed from raid1 to single, > because if we delete all data chunks before mount, > there are raid-type information in filesystem, and btrfs > will use raid-type of "0x0" for new data chunk after your patch. > > So, leave at least one data chunk is a simple workaround for > above two bug. Well it's a hack that besides not being very logical it's also incomplete and unreliable: preventing the cleaner kthread (automatic removal of unused block groups) from deleting the last block group of a kind doesn't prevent the block group from being deleted by a balance, just as the new regression xfstest does... In other words, the problem existed before we had automatic removal of unused block groups, but just less likely for someone to run into it. So please lets have a real fix for this problem. Seems like we're losing raid profile type somewhere when allocating a new block group. thanks > > Thanks > Zhaolei > >> thanks >> >> > >> > Test: >> > Test by above script, and confirmed the logic by debug output. >> > >> > Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> >> > --- >> > fs/btrfs/extent-tree.c | 3 ++- >> > 1 file changed, 2 insertions(+), 1 deletion(-) >> > >> > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index >> > 5411f0a..35cf7eb 100644 >> > --- a/fs/btrfs/extent-tree.c >> > +++ b/fs/btrfs/extent-tree.c >> > @@ -10012,7 +10012,8 @@ void btrfs_delete_unused_bgs(struct >> btrfs_fs_info *fs_info) >> > bg_list); >> > space_info = block_group->space_info; >> > list_del_init(&block_group->bg_list); >> > - if (ret || btrfs_mixed_space_info(space_info)) { >> > + if (ret || btrfs_mixed_space_info(space_info) || >> > + block_group->list.next == block_group->list.prev) >> > + { >> > btrfs_put_block_group(block_group); >> > continue; >> > } >> > -- >> > 1.8.5.1 >> > >> > -- >> > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" >> > in the body of a message to majordomo@vger.kernel.org More majordomo >> > info at http://vger.kernel.org/majordomo-info.html >> >> >> >> -- >> Filipe David Manana, >> >> "Reasonable men adapt themselves to the world. >> Unreasonable men adapt the world to themselves. >> That's why all progress depends on unreasonable men." >
Hi, Filipe David Manana > -----Original Message----- > From: linux-btrfs-owner@vger.kernel.org > [mailto:linux-btrfs-owner@vger.kernel.org] On Behalf Of Zhao Lei > Sent: Tuesday, September 22, 2015 6:06 PM > To: fdmanana@gmail.com > Cc: linux-btrfs@vger.kernel.org > Subject: RE: [PATCH] btrfs: Fix no space bug caused by removing bg > > Hi, Filipe David Manana > > Thanks for review this patch. > > > -----Original Message----- > > From: Filipe David Manana [mailto:fdmanana@gmail.com] > > Sent: Monday, September 21, 2015 9:27 PM > > To: Zhao Lei <zhaolei@cn.fujitsu.com> > > Cc: linux-btrfs@vger.kernel.org > > Subject: Re: [PATCH] btrfs: Fix no space bug caused by removing bg > > > > On Mon, Sep 21, 2015 at 1:59 PM, Zhao Lei <zhaolei@cn.fujitsu.com> wrote: > > > btrfs in v4.3-rc1 failed many xfstests items with '-o nospace_cache' > > > mount option. > > > > > > Failed cases are: > > > > > > btrfs/008,016,019,020,026,027,028,029,031,041,046,048,050,051,053,05 > > > 4, > > > 077,083,084,087,092,094 > > > > Hi Zhao, > > > > So far I tried a few of those against Chris' integration-4.3 branch > > (same btrfs code as 4.3-rc1): > > > > MOUNT_OPTIONS="-o nospace_cache" ./check btrfs/008 btrfs/016 btrfs/019 > > btrfs/020 > > FSTYP -- btrfs > > PLATFORM -- Linux/x86_64 debian3 4.2.0-rc5-btrfs-next-12+ > > MKFS_OPTIONS -- /dev/sdc > > MOUNT_OPTIONS -- -o nospace_cache /dev/sdc > > /home/fdmanana/btrfs-tests/scratch_1 > > > > btrfs/008 2s ... 1s > > btrfs/016 4s ... 3s > > btrfs/019 4s ... 2s > > btrfs/020 2s ... 1s > > Ran: btrfs/008 btrfs/016 btrfs/019 btrfs/020 Passed all 4 tests > > > > And none of the tests failed... > > > Sorry I hadn't paste detail of my test command. > > It is from a coincidence operation which is some different with standard > steps(as yours), I mount fs with -o no_space_cache manually without set > MOUNT_OPT, then xfstests entered into a special path, and triggered the bug: > export TEST_DEV='/dev/sdb5' > export TEST_DIR='/var/ltf/tester/mnt' > mkdir -p '/var/ltf/tester/mnt' > > export SCRATCH_DEV_POOL='/dev/sdb6 /dev/sdb7 /dev/sdb8 /dev/sdb9 > /dev/sdb10 /dev/sdb11' > export SCRATCH_MNT='/var/ltf/tester/scratch_mnt' > mkdir -p '/var/ltf/tester/scratch_mnt' > > export DIFF_LENGTH=0 > > mkfs.btrfs -f "$TEST_DEV" > mount -o nospace_cache "$TEST_DEV" "$TEST_DIR" > > ./check generic/014 > > Result: > FSTYP -- btrfs > PLATFORM -- Linux/x86_64 lenovo > 4.3.0-rc2_HEAD_1f93e4a96c9109378204c147b3eec0d0e8100fde_ > MKFS_OPTIONS -- /dev/sdb6 > MOUNT_OPTIONS -- /dev/sdb6 /var/ltf/tester/scratch_mnt > > generic/014 0s ... - output mismatch (see > /var/lib/xfstests/results//generic/014.out.bad) > --- tests/generic/014.out 2015-09-22 17:46:13.855391451 +0800 > +++ /var/lib/xfstests/results//generic/014.out.bad 2015-09-22 > 17:57:06.446095748 +0800 > @@ -3,4 +3,5 @@ > ------ > test 1 > ------ > -OK > +truncfile returned 1 : "write: No space left on device > +Seed = 1442915826 (use "-s 1442915826" to re-execute this test)" > Ran: generic/014 > Failures: generic/014 > Failed 1 of 1 tests > Plus, by retest, the xfstests fail also happened in standard steps in my node: (with newest xfstests) # btrfs --version btrfs-progs v4.2 # uname -a Linux lenovo 4.3.0-rc2_HEAD_1f93e4a96c9109378204c147b3eec0d0e8100fde_ #1 SMP Mon Sep 21 06:34:49 CST 2015 x86_64 x86_64 x86_64 GNU/Linux # MOUNT_OPTIONS="-o nospace_cache" ./check btrfs/008 FSTYP -- btrfs PLATFORM -- Linux/x86_64 lenovo 4.3.0-rc2_HEAD_1f93e4a96c9109378204c147b3eec0d0e8100fde_ MKFS_OPTIONS -- /dev/sdb6 MOUNT_OPTIONS -- -o nospace_cache /dev/sdb6 /var/ltf/tester/scratch_mnt btrfs/008 1s ... [failed, exit status 1] - output mismatch (see /var/lib/xfstests/results//btrfs/008.out.bad) --- tests/btrfs/008.out 2015-09-22 17:46:12.530391386 +0800 +++ /var/lib/xfstests/results//btrfs/008.out.bad 2015-09-22 18:17:24.699154708 +0800 @@ -1,2 +1,3 @@ QA output created by 008 -Silence is golden +send failed +(see /var/lib/xfstests/results//btrfs/008.full for details) Ran: btrfs/008 Failures: btrfs/008 Failed 1 of 1 tests Maybe there are some different with our nodes, but I think it is no relationship with this bug, and need not investigate the detail reason. Thanks Zhaolei > And following script is from trace result of above test. > Maybe I can remove the xfstest description because it is not standard steps. > > > > > > > generic/004,010,014,023,024,074,075,080,086,087,089,091,092,100,112, > > > 12 > > > 3, > > > 124,125,126,127,131,133,192,193,198,207,208,209,213,214,215,228,239, > > > 24 > > > 0, > > > 246,247,248,255,263,285,306,313,316,323 > > > > > > We can reproduce this bug with following simple command: > > > TEST_DEV=/dev/vdh > > > TEST_DIR=/mnt/tmp > > > > > > umount "$TEST_DEV" >/dev/null > > > mkfs.btrfs -f "$TEST_DEV" > > > mount "$TEST_DEV" "$TEST_DIR" > > > > > > umount "$TEST_DEV" > > > mount "$TEST_DEV" "$TEST_DIR" > > > > > > cp /bin/bash $TEST_DIR > > > > > > Result is: > > > (omit previous commands) > > > # cp /bin/bash $TEST_DIR > > > cp: writing `/mnt/tmp/bash': No space left on device > > > > > > By bisect, we can see it is triggered by patch titled: > > > commit e44163e17796 > > > ("btrfs: explictly delete unused block groups in close_ctree and > > > ro-remount") > > > > > > But the wrong code is not in above patch, btrfs delete all chunks if > > > no data in filesystem, and above patch just make it obviously. > > > > > > Detail reason: > > > 1: mkfs a blank filesystem, or delete everything in filesystem > > > 2: umount fs > > > (current code will delete all data chunks) > > > 3: mount fs > > > Because no any data chunks, data's space_cache have no chance > > > to init, it means: space_info->total_bytes == 0, and > > > space_info->full == 1. > > > > Right, and that's the problem. When the space_info is initialized it > > should never be flagged as full, otherwise any buffered write attempts > > fail immediately with enospc instead of trying to allocate a data > > block group (at extent-tree.c:btrfs_check_data_free_space()). > > > > That was fixed recently by: > > > > https://patchwork.kernel.org/patch/7133451/ > > > > (with a respective test too, > > https://patchwork.kernel.org/patch/7133471/) > > > > It can fix problem in mount, but can not fix problem of "raid-level change", > please see below. > > > > 4: do some write > > > Current code will ignore chunk allocate because space_info->full, > > > and return -ENOSPC. > > > > > > Fix: > > > Don't auto-delete last blockgroup for a raid type. > > > If we delete all blockgroup for a raidtype, it not only cause above > > > bug, but also may change filesystem to all-single in some case. > > > > I don't get this. Can you mention in which cases that happens and why > > (in the commit message)? > > > > It isn't clear when reading the patch why we need to keep at least one > > block of each type/profile, and seems to be a workaround for a different > problem. > > > Simply speaking, if we run following command after apply your patch: > > TEST_DEV=(/dev/vdg /dev/vdh) > TEST_DIR=/mnt/tmp > > umount "$TEST_DEV" >/dev/null > mkfs.btrfs -f -d raid1 "${TEST_DEV[@]}" > mount -o nospace_cache "$TEST_DEV" "$TEST_DIR" > > umount "$TEST_DEV" > mount -o nospace_cache "$TEST_DEV" "$TEST_DIR" > > btrfs filesystem usage $TEST_DIR > > The result is: > # btrfs filesystem usage $TEST_DIR > (omit) > Data,single: Size:8.00MiB, Used:0.00B > /dev/vdg 8.00MiB > ... > > We can see data chunk is changed from raid1 to single, because if we delete all > data chunks before mount, there are raid-type information in filesystem, and > btrfs will use raid-type of "0x0" for new data chunk after your patch. > > So, leave at least one data chunk is a simple workaround for above two bug. > > Thanks > Zhaolei > > > thanks > > > > > > > > Test: > > > Test by above script, and confirmed the logic by debug output. > > > > > > Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> > > > --- > > > fs/btrfs/extent-tree.c | 3 ++- > > > 1 file changed, 2 insertions(+), 1 deletion(-) > > > > > > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index > > > 5411f0a..35cf7eb 100644 > > > --- a/fs/btrfs/extent-tree.c > > > +++ b/fs/btrfs/extent-tree.c > > > @@ -10012,7 +10012,8 @@ void btrfs_delete_unused_bgs(struct > > btrfs_fs_info *fs_info) > > > bg_list); > > > space_info = block_group->space_info; > > > list_del_init(&block_group->bg_list); > > > - if (ret || btrfs_mixed_space_info(space_info)) { > > > + if (ret || btrfs_mixed_space_info(space_info) || > > > + block_group->list.next == > > > + block_group->list.prev) { > > > btrfs_put_block_group(block_group); > > > continue; > > > } > > > -- > > > 1.8.5.1 > > > > > > -- > > > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" > > > in the body of a message to majordomo@vger.kernel.org More majordomo > > > info at http://vger.kernel.org/majordomo-info.html > > > > > > > > -- > > Filipe David Manana, > > > > "Reasonable men adapt themselves to the world. > > Unreasonable men adapt the world to themselves. > > That's why all progress depends on unreasonable men." > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body > of a message to majordomo@vger.kernel.org More majordomo info at > http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi, Filipe David Manana > -----Original Message----- > From: linux-btrfs-owner@vger.kernel.org > [mailto:linux-btrfs-owner@vger.kernel.org] On Behalf Of Filipe David Manana > Sent: Tuesday, September 22, 2015 6:22 PM > To: Zhao Lei <zhaolei@cn.fujitsu.com> > Cc: linux-btrfs@vger.kernel.org > Subject: Re: [PATCH] btrfs: Fix no space bug caused by removing bg > > On Tue, Sep 22, 2015 at 11:06 AM, Zhao Lei <zhaolei@cn.fujitsu.com> wrote: > > Hi, Filipe David Manana > > > > Thanks for review this patch. > > > >> -----Original Message----- > >> From: Filipe David Manana [mailto:fdmanana@gmail.com] > >> Sent: Monday, September 21, 2015 9:27 PM > >> To: Zhao Lei <zhaolei@cn.fujitsu.com> > >> Cc: linux-btrfs@vger.kernel.org > >> Subject: Re: [PATCH] btrfs: Fix no space bug caused by removing bg > >> > >> On Mon, Sep 21, 2015 at 1:59 PM, Zhao Lei <zhaolei@cn.fujitsu.com> > wrote: > >> > btrfs in v4.3-rc1 failed many xfstests items with '-o nospace_cache' > >> > mount option. > >> > > >> > Failed cases are: > >> > > >> > btrfs/008,016,019,020,026,027,028,029,031,041,046,048,050,051,053,0 > >> > 54, > >> > 077,083,084,087,092,094 > >> > >> Hi Zhao, > >> > >> So far I tried a few of those against Chris' integration-4.3 branch > >> (same btrfs code as 4.3-rc1): > >> > >> MOUNT_OPTIONS="-o nospace_cache" ./check btrfs/008 btrfs/016 > >> btrfs/019 > >> btrfs/020 > >> FSTYP -- btrfs > >> PLATFORM -- Linux/x86_64 debian3 4.2.0-rc5-btrfs-next-12+ > >> MKFS_OPTIONS -- /dev/sdc > >> MOUNT_OPTIONS -- -o nospace_cache /dev/sdc > >> /home/fdmanana/btrfs-tests/scratch_1 > >> > >> btrfs/008 2s ... 1s > >> btrfs/016 4s ... 3s > >> btrfs/019 4s ... 2s > >> btrfs/020 2s ... 1s > >> Ran: btrfs/008 btrfs/016 btrfs/019 btrfs/020 Passed all 4 tests > >> > >> And none of the tests failed... > >> > > Sorry I hadn't paste detail of my test command. > > > > It is from a coincidence operation which is some different with > > standard steps(as yours), I mount fs with -o no_space_cache manually > > without set MOUNT_OPT, then xfstests entered into a special path, and > triggered the bug: > > export TEST_DEV='/dev/sdb5' > > export TEST_DIR='/var/ltf/tester/mnt' > > mkdir -p '/var/ltf/tester/mnt' > > > > export SCRATCH_DEV_POOL='/dev/sdb6 /dev/sdb7 /dev/sdb8 /dev/sdb9 > /dev/sdb10 /dev/sdb11' > > export SCRATCH_MNT='/var/ltf/tester/scratch_mnt' > > mkdir -p '/var/ltf/tester/scratch_mnt' > > > > export DIFF_LENGTH=0 > > > > mkfs.btrfs -f "$TEST_DEV" > > mount -o nospace_cache "$TEST_DEV" "$TEST_DIR" > > > > ./check generic/014 > > > > Result: > > FSTYP -- btrfs > > PLATFORM -- Linux/x86_64 lenovo > 4.3.0-rc2_HEAD_1f93e4a96c9109378204c147b3eec0d0e8100fde_ > > MKFS_OPTIONS -- /dev/sdb6 > > MOUNT_OPTIONS -- /dev/sdb6 /var/ltf/tester/scratch_mnt > > > > generic/014 0s ... - output mismatch (see > /var/lib/xfstests/results//generic/014.out.bad) > > --- tests/generic/014.out 2015-09-22 17:46:13.855391451 +0800 > > +++ /var/lib/xfstests/results//generic/014.out.bad 2015-09-22 > 17:57:06.446095748 +0800 > > @@ -3,4 +3,5 @@ > > ------ > > test 1 > > ------ > > -OK > > +truncfile returned 1 : "write: No space left on device > > +Seed = 1442915826 (use "-s 1442915826" to re-execute this test)" > > Ran: generic/014 > > Failures: generic/014 > > Failed 1 of 1 tests > > > > And following script is from trace result of above test. > > Maybe I can remove the xfstest description because it is not standard steps. > > > >> > > >> > generic/004,010,014,023,024,074,075,080,086,087,089,091,092,100,112 > >> > ,12 > >> > 3, > >> > 124,125,126,127,131,133,192,193,198,207,208,209,213,214,215,228,239 > >> > ,24 > >> > 0, > >> > 246,247,248,255,263,285,306,313,316,323 > >> > > >> > We can reproduce this bug with following simple command: > >> > TEST_DEV=/dev/vdh > >> > TEST_DIR=/mnt/tmp > >> > > >> > umount "$TEST_DEV" >/dev/null > >> > mkfs.btrfs -f "$TEST_DEV" > >> > mount "$TEST_DEV" "$TEST_DIR" > >> > > >> > umount "$TEST_DEV" > >> > mount "$TEST_DEV" "$TEST_DIR" > >> > > >> > cp /bin/bash $TEST_DIR > >> > > >> > Result is: > >> > (omit previous commands) > >> > # cp /bin/bash $TEST_DIR > >> > cp: writing `/mnt/tmp/bash': No space left on device > >> > > >> > By bisect, we can see it is triggered by patch titled: > >> > commit e44163e17796 > >> > ("btrfs: explictly delete unused block groups in close_ctree and > >> > ro-remount") > >> > > >> > But the wrong code is not in above patch, btrfs delete all chunks > >> > if no data in filesystem, and above patch just make it obviously. > >> > > >> > Detail reason: > >> > 1: mkfs a blank filesystem, or delete everything in filesystem > >> > 2: umount fs > >> > (current code will delete all data chunks) > >> > 3: mount fs > >> > Because no any data chunks, data's space_cache have no chance > >> > to init, it means: space_info->total_bytes == 0, and > >> > space_info->full == 1. > >> > >> Right, and that's the problem. When the space_info is initialized it > >> should never be flagged as full, otherwise any buffered write > >> attempts fail immediately with enospc instead of trying to allocate a > >> data block group (at extent-tree.c:btrfs_check_data_free_space()). > >> > >> That was fixed recently by: > >> > >> https://patchwork.kernel.org/patch/7133451/ > >> > >> (with a respective test too, > >> https://patchwork.kernel.org/patch/7133471/) > >> > > > > It can fix problem in mount, but can not fix problem of "raid-level > > change", please see below. > > > >> > 4: do some write > >> > Current code will ignore chunk allocate because space_info->full, > >> > and return -ENOSPC. > >> > > >> > Fix: > >> > Don't auto-delete last blockgroup for a raid type. > >> > If we delete all blockgroup for a raidtype, it not only cause > >> > above bug, but also may change filesystem to all-single in some case. > >> > >> I don't get this. Can you mention in which cases that happens and why > >> (in the commit message)? > >> > >> It isn't clear when reading the patch why we need to keep at least > >> one block of each type/profile, and seems to be a workaround for a different > problem. > >> > > Simply speaking, if we run following command after apply your patch: > > > > TEST_DEV=(/dev/vdg /dev/vdh) > > TEST_DIR=/mnt/tmp > > > > umount "$TEST_DEV" >/dev/null > > mkfs.btrfs -f -d raid1 "${TEST_DEV[@]}" > > mount -o nospace_cache "$TEST_DEV" "$TEST_DIR" > > > > umount "$TEST_DEV" > > mount -o nospace_cache "$TEST_DEV" "$TEST_DIR" > > > > btrfs filesystem usage $TEST_DIR > > > > The result is: > > # btrfs filesystem usage $TEST_DIR > > (omit) > > Data,single: Size:8.00MiB, Used:0.00B > > /dev/vdg 8.00MiB > > ... > > > > We can see data chunk is changed from raid1 to single, because if we > > delete all data chunks before mount, there are raid-type information > > in filesystem, and btrfs will use raid-type of "0x0" for new data > > chunk after your patch. > > > > So, leave at least one data chunk is a simple workaround for above two > > bug. > > Well it's a hack that besides not being very logical it's also incomplete and > unreliable: preventing the cleaner kthread (automatic removal of unused block > groups) from deleting the last block group of a kind doesn't prevent the block > group from being deleted by a balance, just as the new regression xfstest > does... > What this patch fixed is regression caused by patch of "auto-remove-bg-at-umount". And the balance problem is another bug, which should be fixed by another patch. > In other words, the problem existed before we had automatic removal of > unused block groups, but just less likely for someone to run into it. > > So please lets have a real fix for this problem. Seems like we're losing raid > profile type somewhere when allocating a new block group. > I think way of "remain last bg" is not a hack way, just like we create one blank data chunk in mkfs. In detail, maybe "writing raidtype in disk" is more direct way, but it will change our disk data format(as superblock), and need change lots of code as chunk-allocate, balance, mkfs, check, or more. plus of compatibility with old filesystem. So the way in this patch maybe the most suit for above bug. Thanks Zhaolei > thanks > > > > > > Thanks > > Zhaolei > > > >> thanks > >> > >> > > >> > Test: > >> > Test by above script, and confirmed the logic by debug output. > >> > > >> > Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> > >> > --- > >> > fs/btrfs/extent-tree.c | 3 ++- > >> > 1 file changed, 2 insertions(+), 1 deletion(-) > >> > > >> > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index > >> > 5411f0a..35cf7eb 100644 > >> > --- a/fs/btrfs/extent-tree.c > >> > +++ b/fs/btrfs/extent-tree.c > >> > @@ -10012,7 +10012,8 @@ void btrfs_delete_unused_bgs(struct > >> btrfs_fs_info *fs_info) > >> > bg_list); > >> > space_info = block_group->space_info; > >> > list_del_init(&block_group->bg_list); > >> > - if (ret || btrfs_mixed_space_info(space_info)) { > >> > + if (ret || btrfs_mixed_space_info(space_info) || > >> > + block_group->list.next == > >> > + block_group->list.prev) { > >> > btrfs_put_block_group(block_group); > >> > continue; > >> > } > >> > -- > >> > 1.8.5.1 > >> > > >> > -- > >> > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" > >> > in the body of a message to majordomo@vger.kernel.org More > >> > majordomo info at http://vger.kernel.org/majordomo-info.html > >> > >> > >> > >> -- > >> Filipe David Manana, > >> > >> "Reasonable men adapt themselves to the world. > >> Unreasonable men adapt the world to themselves. > >> That's why all progress depends on unreasonable men." > > > > > > -- > Filipe David Manana, > > "Reasonable men adapt themselves to the world. > Unreasonable men adapt the world to themselves. > That's why all progress depends on unreasonable men." > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body > of a message to majordomo@vger.kernel.org More majordomo info at > http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Sep 22, 2015 at 12:24 PM, Zhao Lei <zhaolei@cn.fujitsu.com> wrote: > Hi, Filipe David Manana > >> -----Original Message----- >> From: linux-btrfs-owner@vger.kernel.org >> [mailto:linux-btrfs-owner@vger.kernel.org] On Behalf Of Filipe David Manana >> Sent: Tuesday, September 22, 2015 6:22 PM >> To: Zhao Lei <zhaolei@cn.fujitsu.com> >> Cc: linux-btrfs@vger.kernel.org >> Subject: Re: [PATCH] btrfs: Fix no space bug caused by removing bg >> >> On Tue, Sep 22, 2015 at 11:06 AM, Zhao Lei <zhaolei@cn.fujitsu.com> wrote: >> > Hi, Filipe David Manana >> > >> > Thanks for review this patch. >> > >> >> -----Original Message----- >> >> From: Filipe David Manana [mailto:fdmanana@gmail.com] >> >> Sent: Monday, September 21, 2015 9:27 PM >> >> To: Zhao Lei <zhaolei@cn.fujitsu.com> >> >> Cc: linux-btrfs@vger.kernel.org >> >> Subject: Re: [PATCH] btrfs: Fix no space bug caused by removing bg >> >> >> >> On Mon, Sep 21, 2015 at 1:59 PM, Zhao Lei <zhaolei@cn.fujitsu.com> >> wrote: >> >> > btrfs in v4.3-rc1 failed many xfstests items with '-o nospace_cache' >> >> > mount option. >> >> > >> >> > Failed cases are: >> >> > >> >> > btrfs/008,016,019,020,026,027,028,029,031,041,046,048,050,051,053,0 >> >> > 54, >> >> > 077,083,084,087,092,094 >> >> >> >> Hi Zhao, >> >> >> >> So far I tried a few of those against Chris' integration-4.3 branch >> >> (same btrfs code as 4.3-rc1): >> >> >> >> MOUNT_OPTIONS="-o nospace_cache" ./check btrfs/008 btrfs/016 >> >> btrfs/019 >> >> btrfs/020 >> >> FSTYP -- btrfs >> >> PLATFORM -- Linux/x86_64 debian3 4.2.0-rc5-btrfs-next-12+ >> >> MKFS_OPTIONS -- /dev/sdc >> >> MOUNT_OPTIONS -- -o nospace_cache /dev/sdc >> >> /home/fdmanana/btrfs-tests/scratch_1 >> >> >> >> btrfs/008 2s ... 1s >> >> btrfs/016 4s ... 3s >> >> btrfs/019 4s ... 2s >> >> btrfs/020 2s ... 1s >> >> Ran: btrfs/008 btrfs/016 btrfs/019 btrfs/020 Passed all 4 tests >> >> >> >> And none of the tests failed... >> >> >> > Sorry I hadn't paste detail of my test command. >> > >> > It is from a coincidence operation which is some different with >> > standard steps(as yours), I mount fs with -o no_space_cache manually >> > without set MOUNT_OPT, then xfstests entered into a special path, and >> triggered the bug: >> > export TEST_DEV='/dev/sdb5' >> > export TEST_DIR='/var/ltf/tester/mnt' >> > mkdir -p '/var/ltf/tester/mnt' >> > >> > export SCRATCH_DEV_POOL='/dev/sdb6 /dev/sdb7 /dev/sdb8 /dev/sdb9 >> /dev/sdb10 /dev/sdb11' >> > export SCRATCH_MNT='/var/ltf/tester/scratch_mnt' >> > mkdir -p '/var/ltf/tester/scratch_mnt' >> > >> > export DIFF_LENGTH=0 >> > >> > mkfs.btrfs -f "$TEST_DEV" >> > mount -o nospace_cache "$TEST_DEV" "$TEST_DIR" >> > >> > ./check generic/014 >> > >> > Result: >> > FSTYP -- btrfs >> > PLATFORM -- Linux/x86_64 lenovo >> 4.3.0-rc2_HEAD_1f93e4a96c9109378204c147b3eec0d0e8100fde_ >> > MKFS_OPTIONS -- /dev/sdb6 >> > MOUNT_OPTIONS -- /dev/sdb6 /var/ltf/tester/scratch_mnt >> > >> > generic/014 0s ... - output mismatch (see >> /var/lib/xfstests/results//generic/014.out.bad) >> > --- tests/generic/014.out 2015-09-22 17:46:13.855391451 +0800 >> > +++ /var/lib/xfstests/results//generic/014.out.bad 2015-09-22 >> 17:57:06.446095748 +0800 >> > @@ -3,4 +3,5 @@ >> > ------ >> > test 1 >> > ------ >> > -OK >> > +truncfile returned 1 : "write: No space left on device >> > +Seed = 1442915826 (use "-s 1442915826" to re-execute this test)" >> > Ran: generic/014 >> > Failures: generic/014 >> > Failed 1 of 1 tests >> > >> > And following script is from trace result of above test. >> > Maybe I can remove the xfstest description because it is not standard steps. >> > >> >> > >> >> > generic/004,010,014,023,024,074,075,080,086,087,089,091,092,100,112 >> >> > ,12 >> >> > 3, >> >> > 124,125,126,127,131,133,192,193,198,207,208,209,213,214,215,228,239 >> >> > ,24 >> >> > 0, >> >> > 246,247,248,255,263,285,306,313,316,323 >> >> > >> >> > We can reproduce this bug with following simple command: >> >> > TEST_DEV=/dev/vdh >> >> > TEST_DIR=/mnt/tmp >> >> > >> >> > umount "$TEST_DEV" >/dev/null >> >> > mkfs.btrfs -f "$TEST_DEV" >> >> > mount "$TEST_DEV" "$TEST_DIR" >> >> > >> >> > umount "$TEST_DEV" >> >> > mount "$TEST_DEV" "$TEST_DIR" >> >> > >> >> > cp /bin/bash $TEST_DIR >> >> > >> >> > Result is: >> >> > (omit previous commands) >> >> > # cp /bin/bash $TEST_DIR >> >> > cp: writing `/mnt/tmp/bash': No space left on device >> >> > >> >> > By bisect, we can see it is triggered by patch titled: >> >> > commit e44163e17796 >> >> > ("btrfs: explictly delete unused block groups in close_ctree and >> >> > ro-remount") >> >> > >> >> > But the wrong code is not in above patch, btrfs delete all chunks >> >> > if no data in filesystem, and above patch just make it obviously. >> >> > >> >> > Detail reason: >> >> > 1: mkfs a blank filesystem, or delete everything in filesystem >> >> > 2: umount fs >> >> > (current code will delete all data chunks) >> >> > 3: mount fs >> >> > Because no any data chunks, data's space_cache have no chance >> >> > to init, it means: space_info->total_bytes == 0, and >> >> > space_info->full == 1. >> >> >> >> Right, and that's the problem. When the space_info is initialized it >> >> should never be flagged as full, otherwise any buffered write >> >> attempts fail immediately with enospc instead of trying to allocate a >> >> data block group (at extent-tree.c:btrfs_check_data_free_space()). >> >> >> >> That was fixed recently by: >> >> >> >> https://patchwork.kernel.org/patch/7133451/ >> >> >> >> (with a respective test too, >> >> https://patchwork.kernel.org/patch/7133471/) >> >> >> > >> > It can fix problem in mount, but can not fix problem of "raid-level >> > change", please see below. >> > >> >> > 4: do some write >> >> > Current code will ignore chunk allocate because space_info->full, >> >> > and return -ENOSPC. >> >> > >> >> > Fix: >> >> > Don't auto-delete last blockgroup for a raid type. >> >> > If we delete all blockgroup for a raidtype, it not only cause >> >> > above bug, but also may change filesystem to all-single in some case. >> >> >> >> I don't get this. Can you mention in which cases that happens and why >> >> (in the commit message)? >> >> >> >> It isn't clear when reading the patch why we need to keep at least >> >> one block of each type/profile, and seems to be a workaround for a different >> problem. >> >> >> > Simply speaking, if we run following command after apply your patch: >> > >> > TEST_DEV=(/dev/vdg /dev/vdh) >> > TEST_DIR=/mnt/tmp >> > >> > umount "$TEST_DEV" >/dev/null >> > mkfs.btrfs -f -d raid1 "${TEST_DEV[@]}" >> > mount -o nospace_cache "$TEST_DEV" "$TEST_DIR" >> > >> > umount "$TEST_DEV" >> > mount -o nospace_cache "$TEST_DEV" "$TEST_DIR" >> > >> > btrfs filesystem usage $TEST_DIR >> > >> > The result is: >> > # btrfs filesystem usage $TEST_DIR >> > (omit) >> > Data,single: Size:8.00MiB, Used:0.00B >> > /dev/vdg 8.00MiB >> > ... >> > >> > We can see data chunk is changed from raid1 to single, because if we >> > delete all data chunks before mount, there are raid-type information >> > in filesystem, and btrfs will use raid-type of "0x0" for new data >> > chunk after your patch. >> > >> > So, leave at least one data chunk is a simple workaround for above two >> > bug. >> >> Well it's a hack that besides not being very logical it's also incomplete and >> unreliable: preventing the cleaner kthread (automatic removal of unused block >> groups) from deleting the last block group of a kind doesn't prevent the block >> group from being deleted by a balance, just as the new regression xfstest >> does... >> > > What this patch fixed is regression caused by patch > of "auto-remove-bg-at-umount". Well no, that patch did not introduce any problem so far... It only exposes more easily a problem that exists for a long time... > > And the balance problem is another bug, which should be fixed by > another patch. > >> In other words, the problem existed before we had automatic removal of >> unused block groups, but just less likely for someone to run into it. >> >> So please lets have a real fix for this problem. Seems like we're losing raid >> profile type somewhere when allocating a new block group. >> > I think way of "remain last bg" is not a hack way, just like we create > one blank data chunk in mkfs. > > In detail, maybe "writing raidtype in disk" is more direct way, > but it will change our disk data format(as superblock), > and need change lots of code as chunk-allocate, balance, mkfs, check, or more. > plus of compatibility with old filesystem. > > So the way in this patch maybe the most suit for above bug. At the very least this patch should have its title and description changed - to reflect that it attempts to solve only (and partially) the problem of losing raid profile type. Because the enospc issue is not caused by not having any data block group allocated, but it's instead due to a recent regression when initializing a space_info object as full, as described and proved in the patch mentioned before [1] - you could run into the enospc issue even without having the automatic removal of empty block groups or the patch [2] introduced in 4.3 (just try the new xfstest [3] to verify it yourself). [1] https://patchwork.kernel.org/patch/7133451/ [2] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=e44163e177960ee60e32a73bffdd53c3a5827406 [3] https://git.kernel.org/cgit/fs/xfs/xfstests-dev.git/commit/?id=da82741228c7efaedae2e027efc54b1892800fe1 > > Thanks > Zhaolei > >> thanks >> >> >> > >> > Thanks >> > Zhaolei >> > >> >> thanks >> >> >> >> > >> >> > Test: >> >> > Test by above script, and confirmed the logic by debug output. >> >> > >> >> > Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> >> >> > --- >> >> > fs/btrfs/extent-tree.c | 3 ++- >> >> > 1 file changed, 2 insertions(+), 1 deletion(-) >> >> > >> >> > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index >> >> > 5411f0a..35cf7eb 100644 >> >> > --- a/fs/btrfs/extent-tree.c >> >> > +++ b/fs/btrfs/extent-tree.c >> >> > @@ -10012,7 +10012,8 @@ void btrfs_delete_unused_bgs(struct >> >> btrfs_fs_info *fs_info) >> >> > bg_list); >> >> > space_info = block_group->space_info; >> >> > list_del_init(&block_group->bg_list); >> >> > - if (ret || btrfs_mixed_space_info(space_info)) { >> >> > + if (ret || btrfs_mixed_space_info(space_info) || >> >> > + block_group->list.next == >> >> > + block_group->list.prev) { >> >> > btrfs_put_block_group(block_group); >> >> > continue; >> >> > } >> >> > -- >> >> > 1.8.5.1 >> >> > >> >> > -- >> >> > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" >> >> > in the body of a message to majordomo@vger.kernel.org More >> >> > majordomo info at http://vger.kernel.org/majordomo-info.html >> >> >> >> >> >> >> >> -- >> >> Filipe David Manana, >> >> >> >> "Reasonable men adapt themselves to the world. >> >> Unreasonable men adapt the world to themselves. >> >> That's why all progress depends on unreasonable men." >> > >> >> >> >> -- >> Filipe David Manana, >> >> "Reasonable men adapt themselves to the world. >> Unreasonable men adapt the world to themselves. >> That's why all progress depends on unreasonable men." >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body >> of a message to majordomo@vger.kernel.org More majordomo info at >> http://vger.kernel.org/majordomo-info.html >
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 9/21/15 8:59 AM, Zhao Lei wrote: > btrfs in v4.3-rc1 failed many xfstests items with '-o > nospace_cache' mount option. > > Failed cases are: > btrfs/008,016,019,020,026,027,028,029,031,041,046,048,050,051,053,054, > > 077,083,084,087,092,094 > generic/004,010,014,023,024,074,075,080,086,087,089,091,092,100,112,12 3, > > 124,125,126,127,131,133,192,193,198,207,208,209,213,214,215,228,239,240, > 246,247,248,255,263,285,306,313,316,323 > > We can reproduce this bug with following simple command: > TEST_DEV=/dev/vdh TEST_DIR=/mnt/tmp > > umount "$TEST_DEV" >/dev/null mkfs.btrfs -f "$TEST_DEV" mount > "$TEST_DEV" "$TEST_DIR" > > umount "$TEST_DEV" mount "$TEST_DEV" "$TEST_DIR" > > cp /bin/bash $TEST_DIR > > Result is: (omit previous commands) # cp /bin/bash $TEST_DIR cp: > writing `/mnt/tmp/bash': No space left on device > > By bisect, we can see it is triggered by patch titled: commit > e44163e17796 ("btrfs: explictly delete unused block groups in > close_ctree and ro-remount") > > But the wrong code is not in above patch, btrfs delete all chunks > if no data in filesystem, and above patch just make it obviously. > > Detail reason: 1: mkfs a blank filesystem, or delete everything in > filesystem 2: umount fs (current code will delete all data chunks) > 3: mount fs Because no any data chunks, data's space_cache have no > chance to init, it means: space_info->total_bytes == 0, and > space_info->full == 1. 4: do some write Current code will ignore > chunk allocate because space_info->full, and return -ENOSPC. > > Fix: Don't auto-delete last blockgroup for a raid type. The only reason not to do this is the loss off the raid type, which is an issue that needs to be addressed separately. As Filipe said in his responses, this isn't the source of the problem - it just makes it obvious. We've seen in actual bug reports that it's possible to create exactly the same scenario by balancing an empty file system. So if they way we want to prevent the loss of raid type info is by maintaining the last block group allocated with that raid type, fine, but that's a separate discussion. Personally, I think keeping 1GB allocated as a placeholder is a bit much. Beyond that, I've been thinking casually about ways to direct the allocator to use certain devices for certain things (e.g. in a hybrid system with SSDs and HDDs, always allocate metadata on the SSD) and there's some overlap there. As it stands, we can fake that in mkfs but it'll get stomped by balance nearly immediately. - -Jeff > If we delete all blockgroup for a raidtype, it not only cause above > bug, but also may change filesystem to all-single in some case. > > Test: Test by above script, and confirmed the logic by debug > output. > > Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> --- > fs/btrfs/extent-tree.c | 3 ++- 1 file changed, 2 insertions(+), 1 > deletion(-) > > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index > 5411f0a..35cf7eb 100644 --- a/fs/btrfs/extent-tree.c +++ > b/fs/btrfs/extent-tree.c @@ -10012,7 +10012,8 @@ void > btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info) bg_list); > space_info = block_group->space_info; > list_del_init(&block_group->bg_list); - if (ret || > btrfs_mixed_space_info(space_info)) { + if (ret || > btrfs_mixed_space_info(space_info) || + block_group->list.next > == block_group->list.prev) { btrfs_put_block_group(block_group); > continue; } > - -- Jeff Mahoney SUSE Labs -----BEGIN PGP SIGNATURE----- Version: GnuPG/MacGPG2 v2.0.19 (Darwin) iQIcBAEBAgAGBQJWAVDNAAoJEB57S2MheeWyCwoQAId9IK0vYX01W20SeLt5E5ql cabIeN3JCLcmtEbJzhNxQtcjvB7Rgq/r3BRDV0n0Z71dyv8WV8vau4Qka8xUVtLL l+sbuRIEBUR3UHOvqjV7MxSZeZrQZLWeGuCRH9El059hDn/JFsF9n3wJx8YsgXKe dma2RG6MHFVXY08jYkLc6nexBbYlc3Dj2jbd2Jr7gHy4YwFTCM9YncR+STV2K47Q N/pfRwiVHFHHVTju5lg3wzp+xvFPeU52cfWHL05axe8l75pU6Ywwrk406QxyrTvx 2Rh8tXBJItUeMA/D8mRnwWVZBWFUndl6JlBNSyf51fSP+4lPkChbM5UnSOjDOwvE E7XpGy31TQI0bqpy8qoIkI9wkek6iOlMCppZ9U2vICbeP+65WtNZKfQcCO0t6Z1H 6IqfHsaDvvaiorxEWWIarsIfHZWnWJeav545t6pd4VU3v52YQN2YIOLY8EhWv4Wt 90Xc1izPvPvnyQa3eQPg1ISdqNfJRFlYjSJ75zGvSPurIy77oOyvPa1EfEO7IMys zXyjgKzU6Yox1iXxeJsDxuAa+FX9P2rXqd8WYP2mBRqH2BE6D+R8V/NitGmXSkYA bBXN1H/m+gP5qhHLnBQZU+ABH1dDp6RJ1BCsg7iDJBmfE+hJI8YIwowwH/C0RBST 1HgsAUWHmDsjHcYr3/ZB =Li+/ -----END PGP SIGNATURE----- -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Sep 22, 2015 at 08:59:57AM -0400, Jeff Mahoney wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 [snip] > So if they way we want to prevent the loss of raid type info is by > maintaining the last block group allocated with that raid type, fine, > but that's a separate discussion. Personally, I think keeping 1GB > allocated as a placeholder is a bit much. Beyond that, I've been > thinking casually about ways to direct the allocator to use certain > devices for certain things (e.g. in a hybrid system with SSDs and > HDDs, always allocate metadata on the SSD) and there's some overlap > there. As it stands, we can fake that in mkfs but it'll get stomped > by balance nearly immediately. In terms of selecting the location of chunks within the allocator, I wrote up a design for a pretty generic way of doing it some time ago [1]. It would allow things like metadata to SSDs, but also defining failure domains for replication (i.e. "I want two copies of my data in RAID-1, but I want each copy to go on a different storage array"). It would also give us the ability to handle different allocation strategies, such as filling up one device at a time. I got as far as some python to demonstrate the algorithms and structure (also in that mail thread). I started trying to work out how to rewrite the allocator in the kernel to support it, but I got lost in the code fairly rapidly, particularly about how to store the relevant policy metadata (for the FS as a whole, and, later, on a per-subvolume basis). Hugo. [1] http://www.mail-archive.com/linux-btrfs%40vger.kernel.org/msg33499.html > - -Jeff > > > If we delete all blockgroup for a raidtype, it not only cause above > > bug, but also may change filesystem to all-single in some case. > > > > Test: Test by above script, and confirmed the logic by debug > > output. > > > > Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> --- > > fs/btrfs/extent-tree.c | 3 ++- 1 file changed, 2 insertions(+), 1 > > deletion(-) > > > > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index > > 5411f0a..35cf7eb 100644 --- a/fs/btrfs/extent-tree.c +++ > > b/fs/btrfs/extent-tree.c @@ -10012,7 +10012,8 @@ void > > btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info) bg_list); > > space_info = block_group->space_info; > > list_del_init(&block_group->bg_list); - if (ret || > > btrfs_mixed_space_info(space_info)) { + if (ret || > > btrfs_mixed_space_info(space_info) || + block_group->list.next > > == block_group->list.prev) { btrfs_put_block_group(block_group); > > continue; } > > > > > - -- > Jeff Mahoney > SUSE Labs > -----BEGIN PGP SIGNATURE----- > Version: GnuPG/MacGPG2 v2.0.19 (Darwin) > > iQIcBAEBAgAGBQJWAVDNAAoJEB57S2MheeWyCwoQAId9IK0vYX01W20SeLt5E5ql > cabIeN3JCLcmtEbJzhNxQtcjvB7Rgq/r3BRDV0n0Z71dyv8WV8vau4Qka8xUVtLL > l+sbuRIEBUR3UHOvqjV7MxSZeZrQZLWeGuCRH9El059hDn/JFsF9n3wJx8YsgXKe > dma2RG6MHFVXY08jYkLc6nexBbYlc3Dj2jbd2Jr7gHy4YwFTCM9YncR+STV2K47Q > N/pfRwiVHFHHVTju5lg3wzp+xvFPeU52cfWHL05axe8l75pU6Ywwrk406QxyrTvx > 2Rh8tXBJItUeMA/D8mRnwWVZBWFUndl6JlBNSyf51fSP+4lPkChbM5UnSOjDOwvE > E7XpGy31TQI0bqpy8qoIkI9wkek6iOlMCppZ9U2vICbeP+65WtNZKfQcCO0t6Z1H > 6IqfHsaDvvaiorxEWWIarsIfHZWnWJeav545t6pd4VU3v52YQN2YIOLY8EhWv4Wt > 90Xc1izPvPvnyQa3eQPg1ISdqNfJRFlYjSJ75zGvSPurIy77oOyvPa1EfEO7IMys > zXyjgKzU6Yox1iXxeJsDxuAa+FX9P2rXqd8WYP2mBRqH2BE6D+R8V/NitGmXSkYA > bBXN1H/m+gP5qhHLnBQZU+ABH1dDp6RJ1BCsg7iDJBmfE+hJI8YIwowwH/C0RBST > 1HgsAUWHmDsjHcYr3/ZB > =Li+/ > -----END PGP SIGNATURE-----
On 09/22/15 14:59, Jeff Mahoney wrote: (snip) > So if they way we want to prevent the loss of raid type info is by > maintaining the last block group allocated with that raid type, fine, > but that's a separate discussion. Personally, I think keeping 1GB At this point I'm much more surprised to learn that the RAID type can apparently get "lost" in the first place, and is not persisted separately. I mean..wat? > allocated as a placeholder is a bit much. Beyond that, I've been Can you explain why keeping at least one data chunk (or the appropriate number across devices, depending on RAID level..) is "a bit much"? IMHO this would have no real negative impact on end users (who think in terms of overall filesystem space, not how much of that has been lazily touched), nor for more obscure use cases like sparse images - which would still be sparsely reserved. So there would not be any actual downside that I can see. From a fs consistency point of view it sounds much more sane to assume that at least a basic set of data chunks always need to exist. I know I was very surprised recently to see all my data chunks cleaned up on an otherwise empty fs. I mean..it's good to see that the house cleaning works, but you also gotta know when to stop! > thinking casually about ways to direct the allocator to use certain > devices for certain things (e.g. in a hybrid system with SSDs and > HDDs, always allocate metadata on the SSD) and there's some overlap > there. As it stands, we can fake that in mkfs but it'll get stomped > by balance nearly immediately. Please share those casual thoughts with the group. :) -h -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Sep 22, 2015 at 03:36:43PM +0200, Holger Hoffstätte wrote: > On 09/22/15 14:59, Jeff Mahoney wrote: > (snip) > > So if they way we want to prevent the loss of raid type info is by > > maintaining the last block group allocated with that raid type, fine, > > but that's a separate discussion. Personally, I think keeping 1GB > > At this point I'm much more surprised to learn that the RAID type can > apparently get "lost" in the first place, and is not persisted > separately. I mean..wat? It's always been like that, unfortunately. The code tries to use the RAID type that's already present to work out what the next allocation should be. If there aren't any chunks in the FS, the configuration is lost, because it's not stored anywhere else. It's one of the things that tripped me up badly when I was failing to rewrite the chunk allocator last year. > > allocated as a placeholder is a bit much. Beyond that, I've been > > Can you explain why keeping at least one data chunk (or the appropriate > number across devices, depending on RAID level..) is "a bit much"? > IMHO this would have no real negative impact on end users (who think > in terms of overall filesystem space, not how much of that has been > lazily touched), nor for more obscure use cases like sparse images - > which would still be sparsely reserved. So there would not be any > actual downside that I can see. From a fs consistency point of view > it sounds much more sane to assume that at least a basic set of data > chunks always need to exist. I know I was very surprised recently to > see all my data chunks cleaned up on an otherwise empty fs. > I mean..it's good to see that the house cleaning works, but you also > gotta know when to stop! > > > thinking casually about ways to direct the allocator to use certain > > devices for certain things (e.g. in a hybrid system with SSDs and > > HDDs, always allocate metadata on the SSD) and there's some overlap > > there. As it stands, we can fake that in mkfs but it'll get stomped > > by balance nearly immediately. > > Please share those casual thoughts with the group. :) I had some as well last year (see my other mail). I hope they line up with Jeff's. :) Hugo.
On Tue, Sep 22, 2015 at 01:41:31PM +0000, Hugo Mills wrote: > On Tue, Sep 22, 2015 at 03:36:43PM +0200, Holger Hoffstätte wrote: > > On 09/22/15 14:59, Jeff Mahoney wrote: > > (snip) > > > So if they way we want to prevent the loss of raid type info is by > > > maintaining the last block group allocated with that raid type, fine, > > > but that's a separate discussion. Personally, I think keeping 1GB > > > > At this point I'm much more surprised to learn that the RAID type can > > apparently get "lost" in the first place, and is not persisted > > separately. I mean..wat? > > It's always been like that, unfortunately. > > The code tries to use the RAID type that's already present to work > out what the next allocation should be. If there aren't any chunks in > the FS, the configuration is lost, because it's not stored anywhere > else. It's one of the things that tripped me up badly when I was > failing to rewrite the chunk allocator last year. Yeah, right now there's no persistent default for the allocator. I'm still hoping that the object properties will magically solve that. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Sep 22, 2015 at 04:23:33PM +0200, David Sterba wrote: > On Tue, Sep 22, 2015 at 01:41:31PM +0000, Hugo Mills wrote: > > On Tue, Sep 22, 2015 at 03:36:43PM +0200, Holger Hoffstätte wrote: > > > On 09/22/15 14:59, Jeff Mahoney wrote: > > > (snip) > > > > So if they way we want to prevent the loss of raid type info is by > > > > maintaining the last block group allocated with that raid type, fine, > > > > but that's a separate discussion. Personally, I think keeping 1GB > > > > > > At this point I'm much more surprised to learn that the RAID type can > > > apparently get "lost" in the first place, and is not persisted > > > separately. I mean..wat? > > > > It's always been like that, unfortunately. > > > > The code tries to use the RAID type that's already present to work > > out what the next allocation should be. If there aren't any chunks in > > the FS, the configuration is lost, because it's not stored anywhere > > else. It's one of the things that tripped me up badly when I was > > failing to rewrite the chunk allocator last year. > > Yeah, right now there's no persistent default for the allocator. I'm > still hoping that the object properties will magically solve that. There's no obvious place that filesystem-wide properties can be stored, though. There's a userspace tool to manipulate the few current FS-wide properties, but that's all special-cased to use the "historical" ioctls for those properties, with no generalisation of a property store, or even (IIRC) any external API for them. We're nominally using xattrs in the btrfs: namespace on directories and files, and presumably on the top directory of a subvolume for subvol-wide properties, but it's not clear where the FS-wide values should go: in the top directory of subvolid=5 would be confusing, because then you couldn't separate the properties for *that subvol* from the ones for the whole FS (say, the default replication policy, where you might want the top subvol to have different properties from everything else). Hugo.
On 2015-09-22 10:36, Hugo Mills wrote: > On Tue, Sep 22, 2015 at 04:23:33PM +0200, David Sterba wrote: >> On Tue, Sep 22, 2015 at 01:41:31PM +0000, Hugo Mills wrote: >>> On Tue, Sep 22, 2015 at 03:36:43PM +0200, Holger Hoffstätte wrote: >>>> On 09/22/15 14:59, Jeff Mahoney wrote: >>>> (snip) >>>>> So if they way we want to prevent the loss of raid type info is by >>>>> maintaining the last block group allocated with that raid type, fine, >>>>> but that's a separate discussion. Personally, I think keeping 1GB >>>> >>>> At this point I'm much more surprised to learn that the RAID type can >>>> apparently get "lost" in the first place, and is not persisted >>>> separately. I mean..wat? >>> >>> It's always been like that, unfortunately. >>> >>> The code tries to use the RAID type that's already present to work >>> out what the next allocation should be. If there aren't any chunks in >>> the FS, the configuration is lost, because it's not stored anywhere >>> else. It's one of the things that tripped me up badly when I was >>> failing to rewrite the chunk allocator last year. >> >> Yeah, right now there's no persistent default for the allocator. I'm >> still hoping that the object properties will magically solve that. > > There's no obvious place that filesystem-wide properties can be > stored, though. There's a userspace tool to manipulate the few current > FS-wide properties, but that's all special-cased to use the > "historical" ioctls for those properties, with no generalisation of a > property store, or even (IIRC) any external API for them. > > We're nominally using xattrs in the btrfs: namespace on directories > and files, and presumably on the top directory of a subvolume for > subvol-wide properties, but it's not clear where the FS-wide values > should go: in the top directory of subvolid=5 would be confusing, > because then you couldn't separate the properties for *that subvol* > from the ones for the whole FS (say, the default replication policy, > where you might want the top subvol to have different properties from > everything else). Possibly do special names for the defaults and store them there? In general, I personally see little value in having some special 'default' properties however. The way I would expect things to work is that a new subvolume inherits it's properties from it's parent (if it's a snapshot), or from the next higher subvolume it's nested in. This would obviate the need for some special 'default' properties, and would be relatively intuitive behavior for a significant majority of people.
On Tue, Sep 22, 2015 at 10:54:45AM -0400, Austin S Hemmelgarn wrote: > On 2015-09-22 10:36, Hugo Mills wrote: > >On Tue, Sep 22, 2015 at 04:23:33PM +0200, David Sterba wrote: > >>On Tue, Sep 22, 2015 at 01:41:31PM +0000, Hugo Mills wrote: > >>>On Tue, Sep 22, 2015 at 03:36:43PM +0200, Holger Hoffstätte wrote: > >>>>On 09/22/15 14:59, Jeff Mahoney wrote: > >>>>(snip) > >>>>>So if they way we want to prevent the loss of raid type info is by > >>>>>maintaining the last block group allocated with that raid type, fine, > >>>>>but that's a separate discussion. Personally, I think keeping 1GB > >>>> > >>>>At this point I'm much more surprised to learn that the RAID type can > >>>>apparently get "lost" in the first place, and is not persisted > >>>>separately. I mean..wat? > >>> > >>> It's always been like that, unfortunately. > >>> > >>> The code tries to use the RAID type that's already present to work > >>>out what the next allocation should be. If there aren't any chunks in > >>>the FS, the configuration is lost, because it's not stored anywhere > >>>else. It's one of the things that tripped me up badly when I was > >>>failing to rewrite the chunk allocator last year. > >> > >>Yeah, right now there's no persistent default for the allocator. I'm > >>still hoping that the object properties will magically solve that. > > > > There's no obvious place that filesystem-wide properties can be > >stored, though. There's a userspace tool to manipulate the few current > >FS-wide properties, but that's all special-cased to use the > >"historical" ioctls for those properties, with no generalisation of a > >property store, or even (IIRC) any external API for them. > > > > We're nominally using xattrs in the btrfs: namespace on directories > >and files, and presumably on the top directory of a subvolume for > >subvol-wide properties, but it's not clear where the FS-wide values > >should go: in the top directory of subvolid=5 would be confusing, > >because then you couldn't separate the properties for *that subvol* > >from the ones for the whole FS (say, the default replication policy, > >where you might want the top subvol to have different properties from > >everything else). > Possibly do special names for the defaults and store them there? In > general, I personally see little value in having some special > 'default' properties however. That would work. > The way I would expect things to work is that a new subvolume > inherits it's properties from it's parent (if it's a snapshot), Definitely this. > or > from the next higher subvolume it's nested in. I don't think I like this. I'm not quite sure why, though, at the moment. It definitely makes the process at the start of allocating a new block group much more complex: you have to walk back up through an arbitrary depth of nested subvols to find the one that's actually got a replication policy record in it. (Because after this feature is brought in, there will be lots of filesystems without per-subvol replication policies in them, and we have to have some way of dealing with those as well). With an FS default policy, you only need check the current subvol, and then fall back to the FS default if that's not found. These things are, I think, likely to be lightly used: I would be reasonably surprised to find more than two or possibly three storage policies in use on any given system with a sane sysadmin. I'm actually not sure what the interactions of multiple storage policies are going to be like. It's entirely possible, particularly with some of the more exotic (but useful) suggestions I've thought of, that the behaviour of the FS is dependent on the order in which the block groups are allocated. (i.e. "20 GiB to subvol-A, then 20 GiB to subvol-B" results in different behaviour than "1 GiB to subvol-A then 1 GiB to subvol-B and repeat"). I tried some simple Monte-Carlo simulations, but I didn't get any concrete results out of it before the end of the train journey. :) > This would obviate > the need for some special 'default' properties, and would be > relatively intuitive behavior for a significant majority of people. Of course, you shouldn't be nesting subvolumes anyway. It makes it much harder to manage them. Hugo.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 9/22/15 9:36 AM, Holger Hoffstätte wrote: > On 09/22/15 14:59, Jeff Mahoney wrote: (snip) >> So if they way we want to prevent the loss of raid type info is >> by maintaining the last block group allocated with that raid >> type, fine, but that's a separate discussion. Personally, I >> think keeping 1GB > > At this point I'm much more surprised to learn that the RAID type > can apparently get "lost" in the first place, and is not persisted > separately. I mean..wat? > >> allocated as a placeholder is a bit much. Beyond that, I've >> been > > Can you explain why keeping at least one data chunk (or the > appropriate number across devices, depending on RAID level..) is "a > bit much"? IMHO this would have no real negative impact on end > users (who think in terms of overall filesystem space, not how much > of that has been lazily touched), nor for more obscure use cases > like sparse images - which would still be sparsely reserved. So > there would not be any actual downside that I can see. From a fs > consistency point of view it sounds much more sane to assume that > at least a basic set of data chunks always need to exist. I know I > was very surprised recently to see all my data chunks cleaned up on > an otherwise empty fs. I mean..it's good to see that the house > cleaning works, but you also gotta know when to stop! Ok, say you have a file system populated entirely with ~ 3k files. You won't actually need a data chunk at all and you'd just waste that chunk. A file system like that wouldn't need a data chunk at all. It's a contrived case, sure, but not an unrealistic one. We've already seen workloads like this. >> thinking casually about ways to direct the allocator to use >> certain devices for certain things (e.g. in a hybrid system with >> SSDs and HDDs, always allocate metadata on the SSD) and there's >> some overlap there. As it stands, we can fake that in mkfs but >> it'll get stomped by balance nearly immediately. > > Please share those casual thoughts with the group. :) The simplest way would be to add a flag to disallow removal of block groups. If the goal is to say "this device is only to be used for metadata at X RAID level" then it's enough to say that chunks on that device can't be balanced away. Beyond that, we start to run into the same problems we have with RAID levels now. The disk format technically supports multiple RAID levels simultaneously for both data and metadata. How that would work in practice gets a little tricky. Maybe, as David said, something like object properties are the answer. If a subvolume is annotated as needing storage of a particular description, then we can create chunks meeting that requirement as required. - -Jeff - -- Jeff Mahoney SUSE Labs -----BEGIN PGP SIGNATURE----- Version: GnuPG/MacGPG2 v2.0.19 (Darwin) iQIcBAEBAgAGBQJWAYB0AAoJEB57S2MheeWyrjcP/2tnnrkV3Ghxpr5rglfiPEDB ma2OS17FJHkWcPAGlw/KMsxXxU2UX8QNuX9R39Dp+seWprplMPMQQqD12dMZGd9M 4ZeXcv6AzSWW+aE3O1cnXZK4sJMHkb+t5Tk4nRBHY+wf9VIy1bBWwbnfde8jbau0 WubvjDQeCM4AjMbKOhqTsfWtCNXv6pbPzrBI070JQkIvLyvDNVNQ37XQ76The1tW W/urZecv0Efw9fXruGZcvXPKkUOuc/Fh6xsWBQyDqjU0EDx9TyX27ZCSLYmPON6E Ihlk3BY96TV8AeFH0IAbFyWfaYL4pokPF+hW06hjPSgMBOEE8NKnL1SyLFU7xNTt pcQ7KISxGPvLiNyeN0ESDU7WoifmEO0wr/dNSKd4iFJxRi9pImKYViscgNTkPmAt hkFNKANGTwa7/YBXYVwctfYbSdF0DFHakvG4PTtiucjTW04BhIlU8UC/eUqGbpol CRQZY/JPy80Sgc/dNktY3srMfgeWhcu7E+Q6ogS4LQ//5ry1NlxB34nTMVEdjR5v yivkvxjMXZHKxC8CBOF5xDP7ILq5Zf9m+mozjnEI6flto72t4N6TB70Zwj3+3mSw vG3k/kZCqSF6tfNoPUibOuLpZsA9KVpTMtZfU6niOSOuvPBZHyjwR6ErtZoQt+VJ jL+GQMaHE28HjEQ2z64D =9oFr -----END PGP SIGNATURE----- -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 2015-09-22 11:39, Hugo Mills wrote: > On Tue, Sep 22, 2015 at 10:54:45AM -0400, Austin S Hemmelgarn wrote: >> On 2015-09-22 10:36, Hugo Mills wrote: >>> On Tue, Sep 22, 2015 at 04:23:33PM +0200, David Sterba wrote: >>>> On Tue, Sep 22, 2015 at 01:41:31PM +0000, Hugo Mills wrote: >>>>> On Tue, Sep 22, 2015 at 03:36:43PM +0200, Holger Hoffstätte wrote: >>>>>> On 09/22/15 14:59, Jeff Mahoney wrote: >>>>>> (snip) >>>>>>> So if they way we want to prevent the loss of raid type info is by >>>>>>> maintaining the last block group allocated with that raid type, fine, >>>>>>> but that's a separate discussion. Personally, I think keeping 1GB >>>>>> >>>>>> At this point I'm much more surprised to learn that the RAID type can >>>>>> apparently get "lost" in the first place, and is not persisted >>>>>> separately. I mean..wat? >>>>> >>>>> It's always been like that, unfortunately. >>>>> >>>>> The code tries to use the RAID type that's already present to work >>>>> out what the next allocation should be. If there aren't any chunks in >>>>> the FS, the configuration is lost, because it's not stored anywhere >>>>> else. It's one of the things that tripped me up badly when I was >>>>> failing to rewrite the chunk allocator last year. >>>> >>>> Yeah, right now there's no persistent default for the allocator. I'm >>>> still hoping that the object properties will magically solve that. >>> >>> There's no obvious place that filesystem-wide properties can be >>> stored, though. There's a userspace tool to manipulate the few current >>> FS-wide properties, but that's all special-cased to use the >>> "historical" ioctls for those properties, with no generalisation of a >>> property store, or even (IIRC) any external API for them. >>> >>> We're nominally using xattrs in the btrfs: namespace on directories >>> and files, and presumably on the top directory of a subvolume for >>> subvol-wide properties, but it's not clear where the FS-wide values >>> should go: in the top directory of subvolid=5 would be confusing, >>> because then you couldn't separate the properties for *that subvol* >> >from the ones for the whole FS (say, the default replication policy, >>> where you might want the top subvol to have different properties from >>> everything else). >> Possibly do special names for the defaults and store them there? In >> general, I personally see little value in having some special >> 'default' properties however. > > That would work. > >> The way I would expect things to work is that a new subvolume >> inherits it's properties from it's parent (if it's a snapshot), > > Definitely this. > >> or >> from the next higher subvolume it's nested in. > > I don't think I like this. I'm not quite sure why, though, at the > moment. > > It definitely makes the process at the start of allocating a new > block group much more complex: you have to walk back up through an > arbitrary depth of nested subvols to find the one that's actually got > a replication policy record in it. (Because after this feature is > brought in, there will be lots of filesystems without per-subvol > replication policies in them, and we have to have some way of dealing > with those as well). ro-compat flag perhaps? > > With an FS default policy, you only need check the current subvol, > and then fall back to the FS default if that's not found. > > These things are, I think, likely to be lightly used: I would be > reasonably surprised to find more than two or possibly three storage > policies in use on any given system with a sane sysadmin. > > I'm actually not sure what the interactions of multiple storage > policies are going to be like. It's entirely possible, particularly > with some of the more exotic (but useful) suggestions I've thought of, > that the behaviour of the FS is dependent on the order in which the > block groups are allocated. (i.e. "20 GiB to subvol-A, then 20 GiB to > subvol-B" results in different behaviour than "1 GiB to subvol-A then > 1 GiB to subvol-B and repeat"). I tried some simple Monte-Carlo > simulations, but I didn't get any concrete results out of it before > the end of the train journey. :) Yeah, I could easily see that getting complicated when you add in the (hopefully soon) possibility of n-copy replication. > >> This would obviate >> the need for some special 'default' properties, and would be >> relatively intuitive behavior for a significant majority of people. > > Of course, you shouldn't be nesting subvolumes anyway. It makes > it much harder to manage them. That depends though, I only ever do single nesting (ie, a subvolume in a subvolume), and I use it to exclude stuff from getting saved in snapshots (mostly stuff like clones of public git trees, or other stuff that's easy to reproduce without a backup). Beyond that though, there are other inherent issues of course.
On 2015-09-22 13:32, Austin S Hemmelgarn wrote: > On 2015-09-22 11:39, Hugo Mills wrote: >> On Tue, Sep 22, 2015 at 10:54:45AM -0400, Austin S Hemmelgarn wrote: >>> On 2015-09-22 10:36, Hugo Mills wrote: >>>> On Tue, Sep 22, 2015 at 04:23:33PM +0200, David Sterba wrote: >>>>> On Tue, Sep 22, 2015 at 01:41:31PM +0000, Hugo Mills wrote: >>>>>> On Tue, Sep 22, 2015 at 03:36:43PM +0200, Holger Hoffstätte wrote: >>>>>>> On 09/22/15 14:59, Jeff Mahoney wrote: >>>>>>> (snip) >>>>>>>> So if they way we want to prevent the loss of raid type info is by >>>>>>>> maintaining the last block group allocated with that raid type, >>>>>>>> fine, >>>>>>>> but that's a separate discussion. Personally, I think keeping 1GB >>>>>>> >>>>>>> At this point I'm much more surprised to learn that the RAID type >>>>>>> can >>>>>>> apparently get "lost" in the first place, and is not persisted >>>>>>> separately. I mean..wat? >>>>>> >>>>>> It's always been like that, unfortunately. >>>>>> >>>>>> The code tries to use the RAID type that's already present to >>>>>> work >>>>>> out what the next allocation should be. If there aren't any chunks in >>>>>> the FS, the configuration is lost, because it's not stored anywhere >>>>>> else. It's one of the things that tripped me up badly when I was >>>>>> failing to rewrite the chunk allocator last year. >>>>> >>>>> Yeah, right now there's no persistent default for the allocator. I'm >>>>> still hoping that the object properties will magically solve that. >>>> >>>> There's no obvious place that filesystem-wide properties can be >>>> stored, though. There's a userspace tool to manipulate the few current >>>> FS-wide properties, but that's all special-cased to use the >>>> "historical" ioctls for those properties, with no generalisation of a >>>> property store, or even (IIRC) any external API for them. >>>> >>>> We're nominally using xattrs in the btrfs: namespace on directories >>>> and files, and presumably on the top directory of a subvolume for >>>> subvol-wide properties, but it's not clear where the FS-wide values >>>> should go: in the top directory of subvolid=5 would be confusing, >>>> because then you couldn't separate the properties for *that subvol* >>> >from the ones for the whole FS (say, the default replication policy, >>>> where you might want the top subvol to have different properties from >>>> everything else). >>> Possibly do special names for the defaults and store them there? In >>> general, I personally see little value in having some special >>> 'default' properties however. >> >> That would work. >> >>> The way I would expect things to work is that a new subvolume >>> inherits it's properties from it's parent (if it's a snapshot), >> >> Definitely this. >> >>> or >>> from the next higher subvolume it's nested in. >> >> I don't think I like this. I'm not quite sure why, though, at the >> moment. >> >> It definitely makes the process at the start of allocating a new >> block group much more complex: you have to walk back up through an >> arbitrary depth of nested subvols to find the one that's actually got >> a replication policy record in it. (Because after this feature is >> brought in, there will be lots of filesystems without per-subvol >> replication policies in them, and we have to have some way of dealing >> with those as well). > ro-compat flag perhaps? >> >> With an FS default policy, you only need check the current subvol, >> and then fall back to the FS default if that's not found. >> >> These things are, I think, likely to be lightly used: I would be >> reasonably surprised to find more than two or possibly three storage >> policies in use on any given system with a sane sysadmin. >> >> I'm actually not sure what the interactions of multiple storage >> policies are going to be like. It's entirely possible, particularly >> with some of the more exotic (but useful) suggestions I've thought of, >> that the behaviour of the FS is dependent on the order in which the >> block groups are allocated. (i.e. "20 GiB to subvol-A, then 20 GiB to >> subvol-B" results in different behaviour than "1 GiB to subvol-A then >> 1 GiB to subvol-B and repeat"). I tried some simple Monte-Carlo >> simulations, but I didn't get any concrete results out of it before >> the end of the train journey. :) > Yeah, I could easily see that getting complicated when you add in the > (hopefully soon) possibility of n-copy replication. On that note, it might be nice to have the ability to say 'store at least n copies of this data' in addition to being able to say 'store exactly this many copies of this data'. (could be really helpful for filesystems with differing device sizes). >> >>> This would obviate >>> the need for some special 'default' properties, and would be >>> relatively intuitive behavior for a significant majority of people. >> >> Of course, you shouldn't be nesting subvolumes anyway. It makes >> it much harder to manage them. > That depends though, I only ever do single nesting (ie, a subvolume in a > subvolume), and I use it to exclude stuff from getting saved in > snapshots (mostly stuff like clones of public git trees, or other stuff > that's easy to reproduce without a backup). Beyond that though, there > are other inherent issues of course.
Hi, Filipe David Manana > -----Original Message----- > At the very least this patch should have its title and description changed - to > reflect that it attempts to solve only (and partially) the problem of losing raid > profile type. > When I found the bug in testing v4.3-rc1, the only error message is "no space", and there are no way to trigger problem of "losing raid profile type" in v4.3-rc1, this is why this patch titled "fix no space bug". Now I need to update title, because your patch is already in integration-4.3, so in the apply point of this patch, the bug will changed to "losing raid profile type" > Because the enospc issue is not caused by not having any data > block group allocated, but it's instead due to a recent regression when > initializing a space_info object as full, as described and proved in the patch > mentioned before [1]. > The space_info->full is direct reason, but not root reason. "remove all block groups" will cause the file system lost all raidtype information, in this case, we have no way to get raid type again in mount time. Neither "return no space in write" nor "create single-type chunk" can not help user to restore filesystem into right state. > - you could run into the enospc issue even without having > the automatic removal of empty block groups or the patch [2] introduced in > 4.3 (just try the new xfstest [3] to verify it yourself). > Yes, balance operation can also delete all block groups in some case. It is problem of balance code Thanks Zhaolei > [1] https://patchwork.kernel.org/patch/7133451/ > [2] > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=e441 > 63e177960ee60e32a73bffdd53c3a5827406 > [3] > https://git.kernel.org/cgit/fs/xfs/xfstests-dev.git/commit/?id=da82741228c7ef > aedae2e027efc54b1892800fe1 > > > > > Thanks > > Zhaolei > > > >> thanks > >> > >> > >> > > >> > Thanks > >> > Zhaolei > >> > > >> >> thanks > >> >> > >> >> > > >> >> > Test: > >> >> > Test by above script, and confirmed the logic by debug output. > >> >> > > >> >> > Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> > >> >> > --- > >> >> > fs/btrfs/extent-tree.c | 3 ++- > >> >> > 1 file changed, 2 insertions(+), 1 deletion(-) > >> >> > > >> >> > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > >> >> > index 5411f0a..35cf7eb 100644 > >> >> > --- a/fs/btrfs/extent-tree.c > >> >> > +++ b/fs/btrfs/extent-tree.c > >> >> > @@ -10012,7 +10012,8 @@ void btrfs_delete_unused_bgs(struct > >> >> btrfs_fs_info *fs_info) > >> >> > bg_list); > >> >> > space_info = block_group->space_info; > >> >> > list_del_init(&block_group->bg_list); > >> >> > - if (ret || btrfs_mixed_space_info(space_info)) { > >> >> > + if (ret || btrfs_mixed_space_info(space_info) || > >> >> > + block_group->list.next == > >> >> > + block_group->list.prev) { > >> >> > btrfs_put_block_group(block_group); > >> >> > continue; > >> >> > } > >> >> > -- > >> >> > 1.8.5.1 > >> >> > > >> >> > -- > >> >> > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" > >> >> > in the body of a message to majordomo@vger.kernel.org More > >> >> > majordomo info at http://vger.kernel.org/majordomo-info.html > >> >> > >> >> > >> >> > >> >> -- > >> >> Filipe David Manana, > >> >> > >> >> "Reasonable men adapt themselves to the world. > >> >> Unreasonable men adapt the world to themselves. > >> >> That's why all progress depends on unreasonable men." > >> > > >> > >> > >> > >> -- > >> Filipe David Manana, > >> > >> "Reasonable men adapt themselves to the world. > >> Unreasonable men adapt the world to themselves. > >> That's why all progress depends on unreasonable men." > >> -- > >> To unsubscribe from this list: send the line "unsubscribe > >> linux-btrfs" in the body of a message to majordomo@vger.kernel.org > >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > > -- > Filipe David Manana, > > "Reasonable men adapt themselves to the world. > Unreasonable men adapt the world to themselves. > That's why all progress depends on unreasonable men." -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi, Jeff Mahoney > -----Original Message----- > From: Jeff Mahoney [mailto:jeffm@suse.com] > Sent: Tuesday, September 22, 2015 9:00 PM > To: Zhao Lei <zhaolei@cn.fujitsu.com>; linux-btrfs@vger.kernel.org > Subject: Re: [PATCH] btrfs: Fix no space bug caused by removing bg > > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > On 9/21/15 8:59 AM, Zhao Lei wrote: > > btrfs in v4.3-rc1 failed many xfstests items with '-o nospace_cache' > > mount option. > > > > Failed cases are: > > btrfs/008,016,019,020,026,027,028,029,031,041,046,048,050,051,053,054, > > > > > 077,083,084,087,092,094 > > generic/004,010,014,023,024,074,075,080,086,087,089,091,092,100,112,12 > 3, > > > > > 124,125,126,127,131,133,192,193,198,207,208,209,213,214,215,228,239,240 > , > > 246,247,248,255,263,285,306,313,316,323 > > > > We can reproduce this bug with following simple command: > > TEST_DEV=/dev/vdh TEST_DIR=/mnt/tmp > > > > umount "$TEST_DEV" >/dev/null mkfs.btrfs -f "$TEST_DEV" mount > > "$TEST_DEV" "$TEST_DIR" > > > > umount "$TEST_DEV" mount "$TEST_DEV" "$TEST_DIR" > > > > cp /bin/bash $TEST_DIR > > > > Result is: (omit previous commands) # cp /bin/bash $TEST_DIR cp: > > writing `/mnt/tmp/bash': No space left on device > > > > By bisect, we can see it is triggered by patch titled: commit > > e44163e17796 ("btrfs: explictly delete unused block groups in > > close_ctree and ro-remount") > > > > But the wrong code is not in above patch, btrfs delete all chunks if > > no data in filesystem, and above patch just make it obviously. > > > > Detail reason: 1: mkfs a blank filesystem, or delete everything in > > filesystem 2: umount fs (current code will delete all data chunks) > > 3: mount fs Because no any data chunks, data's space_cache have no > > chance to init, it means: space_info->total_bytes == 0, and > > space_info->full == 1. 4: do some write Current code will ignore chunk > > allocate because space_info->full, and return -ENOSPC. > > > > Fix: Don't auto-delete last blockgroup for a raid type. > > The only reason not to do this is the loss off the raid type, which is an issue that > needs to be addressed separately. As Filipe said in his responses, this isn't > the source of the problem - it just makes it obvious. We've seen in actual bug > reports that it's possible to create exactly the same scenario by balancing an > empty file system. > I replied it in filipe's mail, in short, I'll change the patch title and description. > So if they way we want to prevent the loss of raid type info is by maintaining the > last block group allocated with that raid type, fine, but that's a separate > discussion. Personally, I think keeping 1GB allocated as a placeholder is a bit > much. Beyond that, I've been thinking casually about ways to direct the > allocator to use certain devices for certain things (e.g. in a hybrid system with > SSDs and HDDs, always allocate metadata on the SSD) and there's some > overlap there. Good idea! It will increase performance. Thanks Zhaolei > As it stands, we can fake that in mkfs but it'll get stomped by > balance nearly immediately. > > - -Jeff > > > If we delete all blockgroup for a raidtype, it not only cause above > > bug, but also may change filesystem to all-single in some case. > > > > Test: Test by above script, and confirmed the logic by debug output. > > > > Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> --- > > fs/btrfs/extent-tree.c | 3 ++- 1 file changed, 2 insertions(+), 1 > > deletion(-) > > > > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index > > 5411f0a..35cf7eb 100644 --- a/fs/btrfs/extent-tree.c +++ > > b/fs/btrfs/extent-tree.c @@ -10012,7 +10012,8 @@ void > > btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info) bg_list); > > space_info = block_group->space_info; > > list_del_init(&block_group->bg_list); - if (ret || > > btrfs_mixed_space_info(space_info)) { + if (ret || > > btrfs_mixed_space_info(space_info) || + block_group->list.next > > == block_group->list.prev) { btrfs_put_block_group(block_group); > > continue; } > > > > > - -- > Jeff Mahoney > SUSE Labs > -----BEGIN PGP SIGNATURE----- > Version: GnuPG/MacGPG2 v2.0.19 (Darwin) > > iQIcBAEBAgAGBQJWAVDNAAoJEB57S2MheeWyCwoQAId9IK0vYX01W20SeLt5E > 5ql > cabIeN3JCLcmtEbJzhNxQtcjvB7Rgq/r3BRDV0n0Z71dyv8WV8vau4Qka8xUVtLL > l+sbuRIEBUR3UHOvqjV7MxSZeZrQZLWeGuCRH9El059hDn/JFsF9n3wJx8YsgXKe > dma2RG6MHFVXY08jYkLc6nexBbYlc3Dj2jbd2Jr7gHy4YwFTCM9YncR+STV2K47 > Q > N/pfRwiVHFHHVTju5lg3wzp+xvFPeU52cfWHL05axe8l75pU6Ywwrk406QxyrTvx > 2Rh8tXBJItUeMA/D8mRnwWVZBWFUndl6JlBNSyf51fSP+4lPkChbM5UnSOjDOw > vE > E7XpGy31TQI0bqpy8qoIkI9wkek6iOlMCppZ9U2vICbeP+65WtNZKfQcCO0t6Z1H > 6IqfHsaDvvaiorxEWWIarsIfHZWnWJeav545t6pd4VU3v52YQN2YIOLY8EhWv4W > t > 90Xc1izPvPvnyQa3eQPg1ISdqNfJRFlYjSJ75zGvSPurIy77oOyvPa1EfEO7IMys > zXyjgKzU6Yox1iXxeJsDxuAa+FX9P2rXqd8WYP2mBRqH2BE6D+R8V/NitGmXSkYA > bBXN1H/m+gP5qhHLnBQZU+ABH1dDp6RJ1BCsg7iDJBmfE+hJI8YIwowwH/C0R > BST > 1HgsAUWHmDsjHcYr3/ZB > =Li+/ > -----END PGP SIGNATURE----- -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Austin S Hemmelgarn posted on Tue, 22 Sep 2015 13:32:58 -0400 as excerpted: >> Of course, you shouldn't be nesting subvolumes anyway. It makes >> it much harder to manage them. > > That depends though, I only ever do single nesting (ie, a subvolume in a > subvolume), and I use it to exclude stuff from getting saved in > snapshots Good point. Such snapshot-exclusion-via-subvolume-boundary is a frequently recommended use-case for nocow files, of course, to keep them from fragmenting due to snapshot-provoked cow1. If there's no (other) reason to mount the subvolumes separately, either because they're primarily snapshot-exclusion subvolumes or for whatever other reason they aren't intended to be specifically mounted, but instead, simply used as ordinary subdirs for most purposes, then nesting really /does/ make sense, because relying on the normal subdir behavior, that does avoid the otherwise necessary separate mount. Which, meanwhile, brings up a particular bone I have to pick with systemd's use of btrfs subvolumes in various cases, particularly in tmpfiles.d usage, as well. Since IIRC systemd 219, various previously created directories, including the (virtual-) machines tree, are by default subvolume-creations, now, of course only if they don't already exist. These subvolumes are explicitly nested at whatever location they happen to appear in the subtree, as tmpfiles.d simply creates them as subvolumes, with a fallback to the standard subdirs they were created as in earlier versions on filesystems other than btrfs. Other than the standard mount processing, which involves admins specifically setting up the appropriate subvolumes and mounts for them ahead of time, there's no provision made for creating them in a standard subvolume location at root level, and mounting them at whatever tree location their final destination is, as is standard recommended practice for btrfs "upstream". Tho personally, I just prefer that they stay as subdirs, since I don't have any particular need to and therefore don't want to complicate my btrfs use-case with subvolumes, when I already break my tree up into multiple independent btrfs /filesystems/ using standard partitioning, etc, specifically in ordered to avoid having so many data eggs in one filesystem basket, and thus avoid the risk of losing them all should the bottom of that single basket fall out with the filesystem going bad on me. I know from hard learned experience with much-too-big-mdraid arrays how painful maintenance operations can be when they take hours or days, compared to the seconds (due to ssd) or minutes (were they on spinning rust) maintenance such as balance or scrub now takes me on each of my all well under hundred-gig separate btrfs, with the difference even more pronounced since most of the time there's only one or two filesystems I need to do maintenance on, instead of the whole tree. So I use multiple small independent btrfs, and don't want or need subvolumes, which would only complicate my use-case. Fortunately, I can avoid systemd creating those subvolumes even tho I'm on btrfs, the normal target for them, by simply already having the subdirs created... as subdirs. Tho the initial 219 implementation bugged out on them (subvolume-create error because my root filesystem is deliberately left read-only by default, which propagated thru to a tmpfiles.d service failure), I filed a bug, and by 221 (IIRC) tmpfiles.d was working correctly, leaving my existing subdirs alone. Unfortunately, now I have to worry about manual subdir creation to avoid the automated subvolume creation on a new install, or should I forget, worry about discovering and undoing the automated subvolume create. Oh, well...
On Tue, Sep 22, 2015 at 02:36:02PM +0000, Hugo Mills wrote: > > Yeah, right now there's no persistent default for the allocator. I'm > > still hoping that the object properties will magically solve that. > > There's no obvious place that filesystem-wide properties can be > stored, though. There's a userspace tool to manipulate the few current > FS-wide properties, but that's all special-cased to use the > "historical" ioctls for those properties, with no generalisation of a > property store, or even (IIRC) any external API for them. From the UI point, we proposed to add a specifier that would route the property to either subvolume or the filesystem: $ btrfs prop set -t filesystem bgtype raid0 $ btrfs prop set -t subvolume bgtype raid1 How this will get stored in the xattrs is another question. As there's always only single instance of the filesystem properties, it coud be something like 'btrfs.fs.bgtype' and be stored as a xattr of the toplevel subvolume. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
? 2015?09?23? 21:12, David Sterba ??: > On Tue, Sep 22, 2015 at 02:36:02PM +0000, Hugo Mills wrote: >>> Yeah, right now there's no persistent default for the allocator. I'm >>> still hoping that the object properties will magically solve that. >> >> There's no obvious place that filesystem-wide properties can be >> stored, though. There's a userspace tool to manipulate the few current >> FS-wide properties, but that's all special-cased to use the >> "historical" ioctls for those properties, with no generalisation of a >> property store, or even (IIRC) any external API for them. > > From the UI point, we proposed to add a specifier that would route the > property to either subvolume or the filesystem: > > $ btrfs prop set -t filesystem bgtype raid0 > $ btrfs prop set -t subvolume bgtype raid1 > BTW, is btrfs going to support different chunk/bg type for subvolume?! I thought data/meta/system chunk types are all per filesystem level, and was planning to use superblock to record it... If really to support that, does it mean we will have different meta/data type for each subvolume? That's a little too flex for me.... Thanks, Qu > How this will get stored in the xattrs is another question. As there's > always only single instance of the filesystem properties, it coud be > something like 'btrfs.fs.bgtype' and be stored as a xattr of the > toplevel subvolume. > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Sep 23, 2015 at 03:12:26PM +0200, David Sterba wrote: > On Tue, Sep 22, 2015 at 02:36:02PM +0000, Hugo Mills wrote: > > > Yeah, right now there's no persistent default for the allocator. I'm > > > still hoping that the object properties will magically solve that. > > > > There's no obvious place that filesystem-wide properties can be > > stored, though. There's a userspace tool to manipulate the few current > > FS-wide properties, but that's all special-cased to use the > > "historical" ioctls for those properties, with no generalisation of a > > property store, or even (IIRC) any external API for them. > > From the UI point, we proposed to add a specifier that would route the > property to either subvolume or the filesystem: > > $ btrfs prop set -t filesystem bgtype raid0 > $ btrfs prop set -t subvolume bgtype raid1 > > How this will get stored in the xattrs is another question. As there's > always only single instance of the filesystem properties, it coud be > something like 'btrfs.fs.bgtype' and be stored as a xattr of the > toplevel subvolume. That's what Austin suggested as well. Makes sense to me. Hugo.
On Tue, Sep 22, 2015 at 03:39:30PM +0000, Hugo Mills wrote: > > The way I would expect things to work is that a new subvolume > > inherits it's properties from it's parent (if it's a snapshot), > > Definitely this. > > > or > > from the next higher subvolume it's nested in. > > I don't think I like this. I'm not quite sure why, though, at the > moment. I don't like inheritance from other than the parent subvolume because this makes things less obvious. > It definitely makes the process at the start of allocating a new > block group much more complex: you have to walk back up through an > arbitrary depth of nested subvols to find the one that's actually got > a replication policy record in it. (Because after this feature is > brought in, there will be lots of filesystems without per-subvol > replication policies in them, and we have to have some way of dealing > with those as well). > > With an FS default policy, you only need check the current subvol, > and then fall back to the FS default if that's not found. That looks reasonable to me. > These things are, I think, likely to be lightly used: I would be > reasonably surprised to find more than two or possibly three storage > policies in use on any given system with a sane sysadmin. Agreed. At the moment I'm thinking about all the configuration possibilites we want to give to the users. Eg. the inheritance can be configurable on the property level. The usecase: the toplevel has compression enabled but I don't want any new subvolume share this property automatically. (The blockgroup type is probably a bad example for configurable inheritance as it would not work for shared extents if the type is different.) > I'm actually not sure what the interactions of multiple storage > policies are going to be like. It's entirely possible, particularly > with some of the more exotic (but useful) suggestions I've thought of, > that the behaviour of the FS is dependent on the order in which the > block groups are allocated. (i.e. "20 GiB to subvol-A, then 20 GiB to > subvol-B" results in different behaviour than "1 GiB to subvol-A then > 1 GiB to subvol-B and repeat"). I tried some simple Monte-Carlo > simulations, but I didn't get any concrete results out of it before > the end of the train journey. :) > > > This would obviate > > the need for some special 'default' properties, and would be > > relatively intuitive behavior for a significant majority of people. > > Of course, you shouldn't be nesting subvolumes anyway. It makes > it much harder to manage them. Depends, there are valid usecases for nested subvolumes but yeah, the management is harder. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 2015-09-23 09:19, Qu Wenruo wrote: > > > ? 2015?09?23? 21:12, David Sterba ??: >> On Tue, Sep 22, 2015 at 02:36:02PM +0000, Hugo Mills wrote: >>>> Yeah, right now there's no persistent default for the allocator. I'm >>>> still hoping that the object properties will magically solve that. >>> >>> There's no obvious place that filesystem-wide properties can be >>> stored, though. There's a userspace tool to manipulate the few current >>> FS-wide properties, but that's all special-cased to use the >>> "historical" ioctls for those properties, with no generalisation of a >>> property store, or even (IIRC) any external API for them. >> >> From the UI point, we proposed to add a specifier that would route the >> property to either subvolume or the filesystem: >> >> $ btrfs prop set -t filesystem bgtype raid0 >> $ btrfs prop set -t subvolume bgtype raid1 >> > > BTW, is btrfs going to support different chunk/bg type for subvolume?! > I thought data/meta/system chunk types are all per filesystem level, > and was planning to use superblock to record it... > > If really to support that, does it mean we will have different meta/data > type for each subvolume? > That's a little too flex for me.... > This has actually been a planned feature for a while, and really is needed to compete with the flexibility that ZFS gives for this kind of thing. System chunk types should still be set separately (although, once we have n-way replication, they really should be set separately from metadata to at least one copy per device in multi-device filesystems).
On Wed, Sep 23, 2015 at 09:19:38PM +0800, Qu Wenruo wrote: > > From the UI point, we proposed to add a specifier that would route the > > property to either subvolume or the filesystem: > > > > $ btrfs prop set -t filesystem bgtype raid0 > > $ btrfs prop set -t subvolume bgtype raid1 > > BTW, is btrfs going to support different chunk/bg type for subvolume?! People have been asking for that for years. > I thought data/meta/system chunk types are all per filesystem level, > and was planning to use superblock to record it... Any superblock change like that will lead to backward incompatibilities worse than just storing some key:value as an extended attribute. Older kernels will just ignore the constraints but will be still mountable/usable. > If really to support that, does it mean we will have different meta/data > type for each subvolume? > That's a little too flex for me.... btrfs design is ready for this kind of flexibility. The hard task is to stay sane and offer somethig flexible yet manageable. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Sep 23, 2015 at 09:19:38PM +0800, Qu Wenruo wrote: > ? 2015?09?23? 21:12, David Sterba ??: > >On Tue, Sep 22, 2015 at 02:36:02PM +0000, Hugo Mills wrote: > >>>Yeah, right now there's no persistent default for the allocator. I'm > >>>still hoping that the object properties will magically solve that. > >> > >> There's no obvious place that filesystem-wide properties can be > >>stored, though. There's a userspace tool to manipulate the few current > >>FS-wide properties, but that's all special-cased to use the > >>"historical" ioctls for those properties, with no generalisation of a > >>property store, or even (IIRC) any external API for them. > > > > From the UI point, we proposed to add a specifier that would route the > >property to either subvolume or the filesystem: > > > >$ btrfs prop set -t filesystem bgtype raid0 > >$ btrfs prop set -t subvolume bgtype raid1 > > > > BTW, is btrfs going to support different chunk/bg type for subvolume?! > I thought data/meta/system chunk types are all per filesystem level, > and was planning to use superblock to record it... Well, it's been talked about for a *long* time, and it's certainly something for which there's use-cases, and non-zero demand from users. > If really to support that, does it mean we will have different > meta/data type for each subvolume? > That's a little too flex for me.... I think you'd have to have a single metadata type for the whole FS -- it doesn't make much sense to try splitting that up "by subvolume", as almost all of the data structures are common to all subvolumes. For data, however, it should certainly be possible to specify that a subvol's data be stored with a given replication policy. While David and Austin's suggestion of a special xattr on the top level subvol for the FS default is one possible implementation, using the superblock would work just as well -- with the minor detail that you'd have to special-case it into the properties UI, whereas using xattrs, there will be less special-casing involved. Hugo. > Thanks, > Qu > > >How this will get stored in the xattrs is another question. As there's > >always only single instance of the filesystem properties, it coud be > >something like 'btrfs.fs.bgtype' and be stored as a xattr of the > >toplevel subvolume.
On 2015-09-23 09:28, David Sterba wrote: > On Tue, Sep 22, 2015 at 03:39:30PM +0000, Hugo Mills wrote: >>> The way I would expect things to work is that a new subvolume >>> inherits it's properties from it's parent (if it's a snapshot), >> >> Definitely this. >> >>> or >>> from the next higher subvolume it's nested in. >> >> I don't think I like this. I'm not quite sure why, though, at the >> moment. > > I don't like inheritance from other than the parent subvolume because > this makes things less obvious. Possibly, but that depends on how you view things. Internally, subvolumes are independent of each other, but to a regular user (or anything that just uses the VFS layer), they look hierarchical, and as such without knowing the internals of the FS (which no regular user should need to know) I would expect the profile to propagate down from the (apparently) next higher subvolume. It's worth noting that I mean that it should just copy the properties from the next higher subvolume at creation, and if the next higher subvolume doesn't have the properties set, then just use the filesystem defaults (and don't try to walk back up more than one level). >> It definitely makes the process at the start of allocating a new >> block group much more complex: you have to walk back up through an >> arbitrary depth of nested subvols to find the one that's actually got >> a replication policy record in it. (Because after this feature is >> brought in, there will be lots of filesystems without per-subvol >> replication policies in them, and we have to have some way of dealing >> with those as well). >> >> With an FS default policy, you only need check the current subvol, >> and then fall back to the FS default if that's not found. > > That looks reasonable to me. > >> These things are, I think, likely to be lightly used: I would be >> reasonably surprised to find more than two or possibly three storage >> policies in use on any given system with a sane sysadmin. > > Agreed. At the moment I'm thinking about all the configuration > possibilites we want to give to the users. Eg. the inheritance can be > configurable on the property level. > > The usecase: the toplevel has compression enabled but I don't want any > new subvolume share this property automatically. > > (The blockgroup type is probably a bad example for configurable > inheritance as it would not work for shared extents if the type is > different.) Ideal situation in my opinion WRT block-group profile inheritance would be that when an extent becomes shared, it get's re-striped at the highest raid profile (prioritizing 3-copy or higher replication over raid56) of any of the files that share the extent. This would result in some more I/O than the usual clone operation in some cases, but any other method has the potential to silently reduce the degree of data safety for a given file.
? 2015?09?23? 21:32, Austin S Hemmelgarn ??: > On 2015-09-23 09:19, Qu Wenruo wrote: >> >> >> ? 2015?09?23? 21:12, David Sterba ??: >>> On Tue, Sep 22, 2015 at 02:36:02PM +0000, Hugo Mills wrote: >>>>> Yeah, right now there's no persistent default for the allocator. I'm >>>>> still hoping that the object properties will magically solve that. >>>> >>>> There's no obvious place that filesystem-wide properties can be >>>> stored, though. There's a userspace tool to manipulate the few current >>>> FS-wide properties, but that's all special-cased to use the >>>> "historical" ioctls for those properties, with no generalisation of a >>>> property store, or even (IIRC) any external API for them. >>> >>> From the UI point, we proposed to add a specifier that would route the >>> property to either subvolume or the filesystem: >>> >>> $ btrfs prop set -t filesystem bgtype raid0 >>> $ btrfs prop set -t subvolume bgtype raid1 >>> >> >> BTW, is btrfs going to support different chunk/bg type for subvolume?! >> I thought data/meta/system chunk types are all per filesystem level, >> and was planning to use superblock to record it... >> >> If really to support that, does it mean we will have different meta/data >> type for each subvolume? >> That's a little too flex for me.... >> > This has actually been a planned feature for a while, and really is > needed to compete with the flexibility that ZFS gives for this kind of > thing. System chunk types should still be set separately (although, > once we have n-way replication, they really should be set separately > from metadata to at least one copy per device in multi-device filesystems). > > Thanks, David and Austin. As it's already planned, and I think it will need new incompact flag anyway, or old kernel can easily remove/convert desired profile. So what about adding new ROOT_ITEM members to record data/meta profile? I'm not a big fan to use xattr and it's quite easy to modify from user space, and not completely confident with the possible concurrency about xattr modification with chunk allocation. And from the logical level, xattr is at the same level as inode, but subvolume is definitely at a higher level, even it's normally treated as directory. So for me, I'd like to record it into root_item, not as sub xattr, even it may need an extra ioctl to handle it. Anyway, I'm just a new comer for this feature, it's completely OK to ignore my stupid ideas. BTW, what about the original patch? I hope it can be merged in next RC as the affected test case is quite a lot... Thanks, Qu -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Sep 23, 2015 at 03:28:29PM +0200, David Sterba wrote: > On Tue, Sep 22, 2015 at 03:39:30PM +0000, Hugo Mills wrote: > > > The way I would expect things to work is that a new subvolume > > > inherits it's properties from it's parent (if it's a snapshot), > > > > Definitely this. > > > > > or > > > from the next higher subvolume it's nested in. > > > > I don't think I like this. I'm not quite sure why, though, at the > > moment. > > I don't like inheritance from other than the parent subvolume because > this makes things less obvious. > > > It definitely makes the process at the start of allocating a new > > block group much more complex: you have to walk back up through an > > arbitrary depth of nested subvols to find the one that's actually got > > a replication policy record in it. (Because after this feature is > > brought in, there will be lots of filesystems without per-subvol > > replication policies in them, and we have to have some way of dealing > > with those as well). > > > > With an FS default policy, you only need check the current subvol, > > and then fall back to the FS default if that's not found. > > That looks reasonable to me. > > > These things are, I think, likely to be lightly used: I would be > > reasonably surprised to find more than two or possibly three storage > > policies in use on any given system with a sane sysadmin. > > Agreed. At the moment I'm thinking about all the configuration > possibilites we want to give to the users. Eg. the inheritance can be > configurable on the property level. > > The usecase: the toplevel has compression enabled but I don't want any > new subvolume share this property automatically. > > (The blockgroup type is probably a bad example for configurable > inheritance as it would not work for shared extents if the type is > different.) There's some major questions about the semantics and the UI for per-subvol RAID, too: (Assume that A is RAID-1) # btrfs sub snap A A' # btrfs prop set A' raid raid-5 What happens? Do we convert A' to RAID-5 immediately, doubling (or more) the disk usage? Do we do lazy conversion, so that only new data in A' is written to RAID-5? Do we need a "btrfs sub restripe A'" command? Is directly setting properties actually the right interface to use here? (Reading properties... yes, but we still have the possible issues of getting information on partially-converted subvols). # btrfs sub snap A A' # A' is also RAID-1 here: inherit from A # btrfs prop set A' raid raid-5 # btrfs sub restripe A' (you wait, time passes) # btrfs sub snap A A'' # btrfs prop set A'' raid raid-5 # btrfs sub restripe A'' How do I make A' and A'' pick up their sharing again? Dedup? Will this cause problems with incremental send? # btrfs balance start -dconvert=single A What does this do? Convert the whole FS's data to single? Just subvol A? If the whole FS, will all the subvols' raid xattrs be dropped? Or will only the FS-default ones be changed? If only the FS-default xattrs are changed, then a write to any non-default subvol will revert to its configured RAID level, and we almost immediately lose the "everything single" configuration requested by the convert... # btrfs balance start -mconvert=single A What does this do? (Probably convert the whole FS's metadata to single -- which may well answer the previous question as well, unless we want to surprise users). I'll probably think of some more questions later, but that's enough for going on with. :) Hugo.
On Wed, Sep 23, 2015 at 10:00:12PM +0800, Qu Wenruo wrote: > As it's already planned, and I think it will need new incompact flag > anyway, or old kernel can easily remove/convert desired profile. The usecase with the old kernel is colser to the rescue scenario than regular use. We do have support for read-only compatibility (grep for COMPAT_RO), so all of this might be covered by such an incompat flag. > So what about adding new ROOT_ITEM members to record data/meta profile? The profile is not the only property we'd like to add. It is be possible to add it to btrfs_root_item in a safe way, like the new send related items. > I'm not a big fan to use xattr and it's quite easy to modify from user > space, and not completely confident with the possible concurrency about > xattr modification with chunk allocation. The point is to be able to modify it from the userspace, but it's done in a safe way. The xattrs are in their own namespace and have handlers taht propagate the changes anywhere it's needed. The idea is to cache the values along the in-memory structures and only update them via the handlers. > And from the logical level, xattr is at the same level as inode, but > subvolume is definitely at a higher level, even it's normally treated as > directory. > So for me, I'd like to record it into root_item, not as sub xattr, even > it may need an extra ioctl to handle it. The xattrs give us extensibility that does not require changes in the size of root_item structure. Basically, the unknown and unsupported xattrs (properties) are invisible in the b-tree. > Anyway, I'm just a new comer for this feature, it's completely OK to > ignore my stupid ideas. No worries. We had similar discussions in the past, I proposed special ioctls first but then the xattrs were found more suitable. So we're letting you know the status but there's still some chance that we missed something and you can point it out. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 5411f0a..35cf7eb 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -10012,7 +10012,8 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info) bg_list); space_info = block_group->space_info; list_del_init(&block_group->bg_list); - if (ret || btrfs_mixed_space_info(space_info)) { + if (ret || btrfs_mixed_space_info(space_info) || + block_group->list.next == block_group->list.prev) { btrfs_put_block_group(block_group); continue; }
btrfs in v4.3-rc1 failed many xfstests items with '-o nospace_cache' mount option. Failed cases are: btrfs/008,016,019,020,026,027,028,029,031,041,046,048,050,051,053,054, 077,083,084,087,092,094 generic/004,010,014,023,024,074,075,080,086,087,089,091,092,100,112,123, 124,125,126,127,131,133,192,193,198,207,208,209,213,214,215,228,239,240, 246,247,248,255,263,285,306,313,316,323 We can reproduce this bug with following simple command: TEST_DEV=/dev/vdh TEST_DIR=/mnt/tmp umount "$TEST_DEV" >/dev/null mkfs.btrfs -f "$TEST_DEV" mount "$TEST_DEV" "$TEST_DIR" umount "$TEST_DEV" mount "$TEST_DEV" "$TEST_DIR" cp /bin/bash $TEST_DIR Result is: (omit previous commands) # cp /bin/bash $TEST_DIR cp: writing `/mnt/tmp/bash': No space left on device By bisect, we can see it is triggered by patch titled: commit e44163e17796 ("btrfs: explictly delete unused block groups in close_ctree and ro-remount") But the wrong code is not in above patch, btrfs delete all chunks if no data in filesystem, and above patch just make it obviously. Detail reason: 1: mkfs a blank filesystem, or delete everything in filesystem 2: umount fs (current code will delete all data chunks) 3: mount fs Because no any data chunks, data's space_cache have no chance to init, it means: space_info->total_bytes == 0, and space_info->full == 1. 4: do some write Current code will ignore chunk allocate because space_info->full, and return -ENOSPC. Fix: Don't auto-delete last blockgroup for a raid type. If we delete all blockgroup for a raidtype, it not only cause above bug, but also may change filesystem to all-single in some case. Test: Test by above script, and confirmed the logic by debug output. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> --- fs/btrfs/extent-tree.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)