Message ID | 1515401010-26802-1-git-send-email-nborisov@suse.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On 2018年01月08日 16:43, Nikolay Borisov wrote: > This test has been failing for btrfs for quite some time, > at least since 4.7. There are 2 implementation details of btrfs that > it exposes: > > 1. Currently btrfs filesystem under 100mb are created in Mixed block > group mode. Freespace accounting for it is not 100% accurate - I've > observed around 100-200kb discrepancy between a newly created filesystem, > then writing a file and deleting it and checking the free space. This > falls within %3 and not %1 as hardcoded in the test. > > 2. BTRFS won't flush it's delayed allocation on file deletion if less > than 32mb are deleted. On such files we need to perform sync (missing > in the test) or wait until time elapses for transaction commit. I'm a little confused about the 32mb limit. My personal guess about the reason to delay space freeing would be: 1) Performance Btrfs tree operation (at least for write) is slow due to its tree design. So it may makes sense to delay space freeing. But in that case, 32MB may seems to small to really improve the performance. (Max file extent size is 128M, delaying one item deletion doesn't really improve performance) 2) To avoid later new allocation to rewrite the data. It's possible that freed space of deleted inode A get allocated to new file extents. And a power loss happens before we commit the transaction. In that case, if everything else works fine, we should be reverted to previous transaction where deleted inode A still exists. But we lost its data, as its data is overwritten by other file extents. And any read will just cause csum error. But in that case, there shouldn't be any 32MB limit, but all deletion of orphan inodes should be delayed. And further more, this can be addressed using log tree, to log such deletion so at recovery time, we just delete that inode. So I'm wonder if we can improve btrfs deletion behavior. > > Since mixed mode is somewhat deprecated and btrfs is not really intended > to be used on really small devices let's just adjust the test to > create a 101mb fs, which doesn't use mixed mode and really test > freespace accounting. Despite of some btrfs related questions, I'm wondering if there is any standard specifying (POSIX?) how a filesystem should behave when unlinking a file. Should the space freeing be synchronized? And how should statfs report available space? In short, I'm wondering if this test and its expected behavior is generic enough for all filesystems. Thanks, Qu > > Signed-off-by: Nikolay Borisov <nborisov@suse.com> > --- > tests/generic/015 | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/tests/generic/015 b/tests/generic/015 > index 78f2b13..416c4ae 100755 > --- a/tests/generic/015 > +++ b/tests/generic/015 > @@ -53,7 +53,7 @@ _supported_os Linux > _require_scratch > _require_no_large_scratch_dev > > -_scratch_mkfs_sized `expr 50 \* 1024 \* 1024` >/dev/null 2>&1 \ > +_scratch_mkfs_sized `expr 101 \* 1024 \* 1024` >/dev/null 2>&1 \ > || _fail "mkfs failed" > _scratch_mount || _fail "mount failed" > out=$SCRATCH_MNT/fillup.$$ >
2018-01-08 15:54 GMT+03:00 Qu Wenruo <quwenruo.btrfs@gmx.com>: > > > On 2018年01月08日 16:43, Nikolay Borisov wrote: >> This test has been failing for btrfs for quite some time, >> at least since 4.7. There are 2 implementation details of btrfs that >> it exposes: >> >> 1. Currently btrfs filesystem under 100mb are created in Mixed block >> group mode. Freespace accounting for it is not 100% accurate - I've >> observed around 100-200kb discrepancy between a newly created filesystem, >> then writing a file and deleting it and checking the free space. This >> falls within %3 and not %1 as hardcoded in the test. >> >> 2. BTRFS won't flush it's delayed allocation on file deletion if less >> than 32mb are deleted. On such files we need to perform sync (missing >> in the test) or wait until time elapses for transaction commit. > > I'm a little confused about the 32mb limit. > > My personal guess about the reason to delay space freeing would be: > 1) Performance > Btrfs tree operation (at least for write) is slow due to its tree > design. > So it may makes sense to delay space freeing. > > But in that case, 32MB may seems to small to really improve the > performance. (Max file extent size is 128M, delaying one item > deletion doesn't really improve performance) > > 2) To avoid later new allocation to rewrite the data. > It's possible that freed space of deleted inode A get allocated to > new file extents. And a power loss happens before we commit the > transaction. > > In that case, if everything else works fine, we should be reverted to > previous transaction where deleted inode A still exists. > But we lost its data, as its data is overwritten by other file > extents. And any read will just cause csum error. > > But in that case, there shouldn't be any 32MB limit, but all deletion > of orphan inodes should be delayed. > > And further more, this can be addressed using log tree, to log such > deletion so at recovery time, we just delete that inode. > > So I'm wonder if we can improve btrfs deletion behavior. > > >> >> Since mixed mode is somewhat deprecated and btrfs is not really intended >> to be used on really small devices let's just adjust the test to >> create a 101mb fs, which doesn't use mixed mode and really test >> freespace accounting. > > Despite of some btrfs related questions, I'm wondering if there is any > standard specifying (POSIX?) how a filesystem should behave when > unlinking a file. > > Should the space freeing be synchronized? And how should statfs report > available space? > > In short, I'm wondering if this test and its expected behavior is > generic enough for all filesystems. > > Thanks, > Qu > >> >> Signed-off-by: Nikolay Borisov <nborisov@suse.com> >> --- >> tests/generic/015 | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/tests/generic/015 b/tests/generic/015 >> index 78f2b13..416c4ae 100755 >> --- a/tests/generic/015 >> +++ b/tests/generic/015 >> @@ -53,7 +53,7 @@ _supported_os Linux >> _require_scratch >> _require_no_large_scratch_dev >> >> -_scratch_mkfs_sized `expr 50 \* 1024 \* 1024` >/dev/null 2>&1 \ >> +_scratch_mkfs_sized `expr 101 \* 1024 \* 1024` >/dev/null 2>&1 \ >> || _fail "mkfs failed" >> _scratch_mount || _fail "mount failed" >> out=$SCRATCH_MNT/fillup.$$ >> > All fs, including btrfs (AFAIK) return unlink(), (if file not open) only then space has been freed. So free space after return of unlink() must be freed. Proofs: [1] [2] [3] [4] - Posix, looks like do not describe behaviour. 1. http://man7.org/linux/man-pages/man2/unlink.2.html 2. https://stackoverflow.com/questions/31448693/why-system-call-unlink-so-slow 3. https://www.spinics.net/lists/linux-btrfs/msg59901.html 4. https://www.unix.com/man-page/posix/1P/unlink/ Thanks.
On 8.01.2018 14:54, Qu Wenruo wrote: > > > On 2018年01月08日 16:43, Nikolay Borisov wrote: >> This test has been failing for btrfs for quite some time, >> at least since 4.7. There are 2 implementation details of btrfs that >> it exposes: >> >> 1. Currently btrfs filesystem under 100mb are created in Mixed block >> group mode. Freespace accounting for it is not 100% accurate - I've >> observed around 100-200kb discrepancy between a newly created filesystem, >> then writing a file and deleting it and checking the free space. This >> falls within %3 and not %1 as hardcoded in the test. >> >> 2. BTRFS won't flush it's delayed allocation on file deletion if less >> than 32mb are deleted. On such files we need to perform sync (missing >> in the test) or wait until time elapses for transaction commit. > > I'm a little confused about the 32mb limit. > > My personal guess about the reason to delay space freeing would be: > 1) Performance > Btrfs tree operation (at least for write) is slow due to its tree > design. > So it may makes sense to delay space freeing. > > But in that case, 32MB may seems to small to really improve the > performance. (Max file extent size is 128M, delaying one item > deletion doesn't really improve performance) > > 2) To avoid later new allocation to rewrite the data. > It's possible that freed space of deleted inode A get allocated to > new file extents. And a power loss happens before we commit the > transaction. > > In that case, if everything else works fine, we should be reverted to > previous transaction where deleted inode A still exists. > But we lost its data, as its data is overwritten by other file > extents. And any read will just cause csum error. > > But in that case, there shouldn't be any 32MB limit, but all deletion > of orphan inodes should be delayed. > > And further more, this can be addressed using log tree, to log such > deletion so at recovery time, we just delete that inode. > > So I'm wonder if we can improve btrfs deletion behavior. So the initial commit I'm talking about is this one: 28ed1345a50491d78e1454ad4005dc5d3557a69e and the function in question is: btrfs_truncate_inode_items > > >> >> Since mixed mode is somewhat deprecated and btrfs is not really intended >> to be used on really small devices let's just adjust the test to >> create a 101mb fs, which doesn't use mixed mode and really test >> freespace accounting. > > Despite of some btrfs related questions, I'm wondering if there is any > standard specifying (POSIX?) how a filesystem should behave when > unlinking a file. > > Should the space freeing be synchronized? And how should statfs report > available space? While digging a bit more into this I also found that we alway have a discrepancy of 1mb when reporting statfs if we have unallocated device space and if everything is allocated to chunks. The reason being that if we have unallocated space on the device then btrfs_calc_avail_data_space subtracts 1mb from the unallocated space of the device. This is apparently done to mimics the allocator's behavior. However, when everything is reported based on the status of the space_info struct i.e btrfs_calc_avail_data_space return 0 in free_bytes then we don't do this subtraction. So in the end the issue might not really be in the test but in the way we report freespace from df. And 101mb happens to work because the 1mb discrepancy falls within %1 error margin on a fs of 101mb size > > In short, I'm wondering if this test and its expected behavior is > generic enough for all filesystems. > > Thanks, > Qu > >> >> Signed-off-by: Nikolay Borisov <nborisov@suse.com> >> --- >> tests/generic/015 | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/tests/generic/015 b/tests/generic/015 >> index 78f2b13..416c4ae 100755 >> --- a/tests/generic/015 >> +++ b/tests/generic/015 >> @@ -53,7 +53,7 @@ _supported_os Linux >> _require_scratch >> _require_no_large_scratch_dev >> >> -_scratch_mkfs_sized `expr 50 \* 1024 \* 1024` >/dev/null 2>&1 \ >> +_scratch_mkfs_sized `expr 101 \* 1024 \* 1024` >/dev/null 2>&1 \ >> || _fail "mkfs failed" >> _scratch_mount || _fail "mount failed" >> out=$SCRATCH_MNT/fillup.$$ >> > -- To unsubscribe from this list: send the line "unsubscribe fstests" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 2018年01月08日 21:55, Timofey Titovets wrote: > 2018-01-08 15:54 GMT+03:00 Qu Wenruo <quwenruo.btrfs@gmx.com>: >> >> >> On 2018年01月08日 16:43, Nikolay Borisov wrote: >>> This test has been failing for btrfs for quite some time, >>> at least since 4.7. There are 2 implementation details of btrfs that >>> it exposes: >>> >>> 1. Currently btrfs filesystem under 100mb are created in Mixed block >>> group mode. Freespace accounting for it is not 100% accurate - I've >>> observed around 100-200kb discrepancy between a newly created filesystem, >>> then writing a file and deleting it and checking the free space. This >>> falls within %3 and not %1 as hardcoded in the test. >>> >>> 2. BTRFS won't flush it's delayed allocation on file deletion if less >>> than 32mb are deleted. On such files we need to perform sync (missing >>> in the test) or wait until time elapses for transaction commit. >> >> I'm a little confused about the 32mb limit. >> >> My personal guess about the reason to delay space freeing would be: >> 1) Performance >> Btrfs tree operation (at least for write) is slow due to its tree >> design. >> So it may makes sense to delay space freeing. >> >> But in that case, 32MB may seems to small to really improve the >> performance. (Max file extent size is 128M, delaying one item >> deletion doesn't really improve performance) >> >> 2) To avoid later new allocation to rewrite the data. >> It's possible that freed space of deleted inode A get allocated to >> new file extents. And a power loss happens before we commit the >> transaction. >> >> In that case, if everything else works fine, we should be reverted to >> previous transaction where deleted inode A still exists. >> But we lost its data, as its data is overwritten by other file >> extents. And any read will just cause csum error. >> >> But in that case, there shouldn't be any 32MB limit, but all deletion >> of orphan inodes should be delayed. >> >> And further more, this can be addressed using log tree, to log such >> deletion so at recovery time, we just delete that inode. >> >> So I'm wonder if we can improve btrfs deletion behavior. >> >> >>> >>> Since mixed mode is somewhat deprecated and btrfs is not really intended >>> to be used on really small devices let's just adjust the test to >>> create a 101mb fs, which doesn't use mixed mode and really test >>> freespace accounting. >> >> Despite of some btrfs related questions, I'm wondering if there is any >> standard specifying (POSIX?) how a filesystem should behave when >> unlinking a file. >> >> Should the space freeing be synchronized? And how should statfs report >> available space? >> >> In short, I'm wondering if this test and its expected behavior is >> generic enough for all filesystems. >> >> Thanks, >> Qu >> >>> >>> Signed-off-by: Nikolay Borisov <nborisov@suse.com> >>> --- >>> tests/generic/015 | 2 +- >>> 1 file changed, 1 insertion(+), 1 deletion(-) >>> >>> diff --git a/tests/generic/015 b/tests/generic/015 >>> index 78f2b13..416c4ae 100755 >>> --- a/tests/generic/015 >>> +++ b/tests/generic/015 >>> @@ -53,7 +53,7 @@ _supported_os Linux >>> _require_scratch >>> _require_no_large_scratch_dev >>> >>> -_scratch_mkfs_sized `expr 50 \* 1024 \* 1024` >/dev/null 2>&1 \ >>> +_scratch_mkfs_sized `expr 101 \* 1024 \* 1024` >/dev/null 2>&1 \ >>> || _fail "mkfs failed" >>> _scratch_mount || _fail "mount failed" >>> out=$SCRATCH_MNT/fillup.$$ >>> >> > > All fs, including btrfs (AFAIK) return unlink(), (if file not open) > only then space has been freed. AFAIK there is btrfs orphan inode item for delayed deletion. So at least not for btrfs. > So free space after return of unlink() must be freed. > Proofs: [1] [2] [3] > [4] - Posix, looks like do not describe behaviour. > > 1. http://man7.org/linux/man-pages/man2/unlink.2.html "If that name was the last link to a file and no processes have the file open, the file is deleted and the space it was using is made available for reuse." But when? I think that's the point, and no exact explain on this. Thanks, Qu > 2. https://stackoverflow.com/questions/31448693/why-system-call-unlink-so-slow > 3. https://www.spinics.net/lists/linux-btrfs/msg59901.html > 4. https://www.unix.com/man-page/posix/1P/unlink/ > > Thanks. >
On 8.01.2018 10:43, Nikolay Borisov wrote: > This test has been failing for btrfs for quite some time, > at least since 4.7. There are 2 implementation details of btrfs that > it exposes: > > 1. Currently btrfs filesystem under 100mb are created in Mixed block > group mode. Freespace accounting for it is not 100% accurate - I've > observed around 100-200kb discrepancy between a newly created filesystem, > then writing a file and deleting it and checking the free space. This > falls within %3 and not %1 as hardcoded in the test. > > 2. BTRFS won't flush it's delayed allocation on file deletion if less > than 32mb are deleted. On such files we need to perform sync (missing > in the test) or wait until time elapses for transaction commit. > > Since mixed mode is somewhat deprecated and btrfs is not really intended > to be used on really small devices let's just adjust the test to > create a 101mb fs, which doesn't use mixed mode and really test > freespace accounting. > > Signed-off-by: Nikolay Borisov <nborisov@suse.com> After further experimentation this test is indeed hitting a very pathological edge case. So when we create a btrfs file system in mixed mode with default raid levels the system chunk occupies 0-4mb. As such we can never allocate a chunk which spans this range so the -1mb in btrfs_calc_avail_data_space is not required in this case. Anyway, I believe this to be a really minor nuisance not worth complicating the code to fix, so I'd suggest this test change be merged. > --- > tests/generic/015 | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/tests/generic/015 b/tests/generic/015 > index 78f2b13..416c4ae 100755 > --- a/tests/generic/015 > +++ b/tests/generic/015 > @@ -53,7 +53,7 @@ _supported_os Linux > _require_scratch > _require_no_large_scratch_dev > > -_scratch_mkfs_sized `expr 50 \* 1024 \* 1024` >/dev/null 2>&1 \ > +_scratch_mkfs_sized `expr 101 \* 1024 \* 1024` >/dev/null 2>&1 \ > || _fail "mkfs failed" > _scratch_mount || _fail "mount failed" > out=$SCRATCH_MNT/fillup.$$ > -- To unsubscribe from this list: send the line "unsubscribe fstests" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Jan 08, 2018 at 10:43:30AM +0200, Nikolay Borisov wrote: > This test has been failing for btrfs for quite some time, > at least since 4.7. There are 2 implementation details of btrfs that > it exposes: > > 1. Currently btrfs filesystem under 100mb are created in Mixed block > group mode. Freespace accounting for it is not 100% accurate - I've mkfs.btrfs does this too? Because I noticed _scratch_mkfs_sized adds '--mixed' mkfs option explicitly. > observed around 100-200kb discrepancy between a newly created filesystem, > then writing a file and deleting it and checking the free space. This > falls within %3 and not %1 as hardcoded in the test. > > 2. BTRFS won't flush it's delayed allocation on file deletion if less > than 32mb are deleted. On such files we need to perform sync (missing > in the test) or wait until time elapses for transaction commit. > > Since mixed mode is somewhat deprecated and btrfs is not really intended > to be used on really small devices let's just adjust the test to > create a 101mb fs, which doesn't use mixed mode and really test > freespace accounting. > > Signed-off-by: Nikolay Borisov <nborisov@suse.com> > --- > tests/generic/015 | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/tests/generic/015 b/tests/generic/015 > index 78f2b13..416c4ae 100755 > --- a/tests/generic/015 > +++ b/tests/generic/015 > @@ -53,7 +53,7 @@ _supported_os Linux > _require_scratch > _require_no_large_scratch_dev > > -_scratch_mkfs_sized `expr 50 \* 1024 \* 1024` >/dev/null 2>&1 \ > +_scratch_mkfs_sized `expr 101 \* 1024 \* 1024` >/dev/null 2>&1 \ Better to have some comments in the code too, to explain why we choose 101m filesystem size. Thanks, Eryu > || _fail "mkfs failed" > _scratch_mount || _fail "mount failed" > out=$SCRATCH_MNT/fillup.$$ > -- > 2.7.4 > > -- > To unsubscribe from this list: send the line "unsubscribe fstests" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe fstests" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/tests/generic/015 b/tests/generic/015 index 78f2b13..416c4ae 100755 --- a/tests/generic/015 +++ b/tests/generic/015 @@ -53,7 +53,7 @@ _supported_os Linux _require_scratch _require_no_large_scratch_dev -_scratch_mkfs_sized `expr 50 \* 1024 \* 1024` >/dev/null 2>&1 \ +_scratch_mkfs_sized `expr 101 \* 1024 \* 1024` >/dev/null 2>&1 \ || _fail "mkfs failed" _scratch_mount || _fail "mount failed" out=$SCRATCH_MNT/fillup.$$
This test has been failing for btrfs for quite some time, at least since 4.7. There are 2 implementation details of btrfs that it exposes: 1. Currently btrfs filesystem under 100mb are created in Mixed block group mode. Freespace accounting for it is not 100% accurate - I've observed around 100-200kb discrepancy between a newly created filesystem, then writing a file and deleting it and checking the free space. This falls within %3 and not %1 as hardcoded in the test. 2. BTRFS won't flush it's delayed allocation on file deletion if less than 32mb are deleted. On such files we need to perform sync (missing in the test) or wait until time elapses for transaction commit. Since mixed mode is somewhat deprecated and btrfs is not really intended to be used on really small devices let's just adjust the test to create a 101mb fs, which doesn't use mixed mode and really test freespace accounting. Signed-off-by: Nikolay Borisov <nborisov@suse.com> --- tests/generic/015 | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)