Message ID | 20191216182656.15624-2-fdmanana@kernel.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Allow deduplication of the eof block when it is safe to do so | expand |
On 12/16/19 1:26 PM, fdmanana@kernel.org wrote: > From: Filipe Manana <fdmanana@suse.com> > > We always round down, to a multiple of the filesystem's block size, the > length to deduplicate at generic_remap_check_len(). However this is only > needed if an attempt to deduplicate the last block into the middle of the > destination file is requested, since that leads into a corruption if the > length of the source file is not block size aligned. When an attempt to > deduplicate the last block into the end of the destination file is > requested, we should allow it because it is safe to do it - there's no > stale data exposure and we are prepared to compare the data ranges for > a length not aligned to the block (or page) size - in fact we even do > the data compare before adjusting the deduplication length. > > After btrfs was updated to use the generic helpers from VFS (by commit > 34a28e3d77535e ("Btrfs: use generic_remap_file_range_prep() for cloning > and deduplication")) we started to have user reports of deduplication > not reflinking the last block anymore, and whence users getting lower > deduplication scores. The main use case is deduplication of entire > files that have a size not aligned to the block size of the filesystem. > > We already allow cloning the last block to the end (and beyond) of the > destination file, so allow for deduplication as well. > > Link: https://lore.kernel.org/linux-btrfs/2019-1576167349.500456@svIo.N5dq.dFFD/ > Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Thanks, Josef
On Mon, Dec 16, 2019 at 6:28 PM <fdmanana@kernel.org> wrote: > > From: Filipe Manana <fdmanana@suse.com> > > We always round down, to a multiple of the filesystem's block size, the > length to deduplicate at generic_remap_check_len(). However this is only > needed if an attempt to deduplicate the last block into the middle of the > destination file is requested, since that leads into a corruption if the > length of the source file is not block size aligned. When an attempt to > deduplicate the last block into the end of the destination file is > requested, we should allow it because it is safe to do it - there's no > stale data exposure and we are prepared to compare the data ranges for > a length not aligned to the block (or page) size - in fact we even do > the data compare before adjusting the deduplication length. > > After btrfs was updated to use the generic helpers from VFS (by commit > 34a28e3d77535e ("Btrfs: use generic_remap_file_range_prep() for cloning > and deduplication")) we started to have user reports of deduplication > not reflinking the last block anymore, and whence users getting lower > deduplication scores. The main use case is deduplication of entire > files that have a size not aligned to the block size of the filesystem. > > We already allow cloning the last block to the end (and beyond) of the > destination file, so allow for deduplication as well. > > Link: https://lore.kernel.org/linux-btrfs/2019-1576167349.500456@svIo.N5dq.dFFD/ > Signed-off-by: Filipe Manana <fdmanana@suse.com> Darrick, Al, any feedback? Thanks. > --- > fs/read_write.c | 10 ++++------ > 1 file changed, 4 insertions(+), 6 deletions(-) > > diff --git a/fs/read_write.c b/fs/read_write.c > index 5bbf587f5bc1..7458fccc59e1 100644 > --- a/fs/read_write.c > +++ b/fs/read_write.c > @@ -1777,10 +1777,9 @@ static int remap_verify_area(struct file *file, loff_t pos, loff_t len, > * else. Assume that the offsets have already been checked for block > * alignment. > * > - * For deduplication we always scale down to the previous block because we > - * can't meaningfully compare post-EOF contents. > - * > - * For clone we only link a partial EOF block above the destination file's EOF. > + * For clone we only link a partial EOF block above or at the destination file's > + * EOF. For deduplication we accept a partial EOF block only if it ends at the > + * destination file's EOF (can not link it into the middle of a file). > * > * Shorten the request if possible. > */ > @@ -1796,8 +1795,7 @@ static int generic_remap_check_len(struct inode *inode_in, > if ((*len & blkmask) == 0) > return 0; > > - if ((remap_flags & REMAP_FILE_DEDUP) || > - pos_out + *len < i_size_read(inode_out)) > + if (pos_out + *len < i_size_read(inode_out)) > new_len &= ~blkmask; > > if (new_len == *len) > -- > 2.11.0 >
On Tue, Jan 07, 2020 at 04:23:15PM +0000, Filipe Manana wrote: > On Mon, Dec 16, 2019 at 6:28 PM <fdmanana@kernel.org> wrote: > > > > From: Filipe Manana <fdmanana@suse.com> > > > > We always round down, to a multiple of the filesystem's block size, the > > length to deduplicate at generic_remap_check_len(). However this is only > > needed if an attempt to deduplicate the last block into the middle of the > > destination file is requested, since that leads into a corruption if the > > length of the source file is not block size aligned. When an attempt to > > deduplicate the last block into the end of the destination file is > > requested, we should allow it because it is safe to do it - there's no > > stale data exposure and we are prepared to compare the data ranges for > > a length not aligned to the block (or page) size - in fact we even do > > the data compare before adjusting the deduplication length. > > > > After btrfs was updated to use the generic helpers from VFS (by commit > > 34a28e3d77535e ("Btrfs: use generic_remap_file_range_prep() for cloning > > and deduplication")) we started to have user reports of deduplication > > not reflinking the last block anymore, and whence users getting lower > > deduplication scores. The main use case is deduplication of entire > > files that have a size not aligned to the block size of the filesystem. > > > > We already allow cloning the last block to the end (and beyond) of the > > destination file, so allow for deduplication as well. > > > > Link: https://lore.kernel.org/linux-btrfs/2019-1576167349.500456@svIo.N5dq.dFFD/ > > Signed-off-by: Filipe Manana <fdmanana@suse.com> > > Darrick, Al, any feedback? Is there a fstest to check for correct operation of dedupe at or beyond source and destfile EOF? Particularly if one range is /not/ at EOF? And that an mmap read of the EOF block will see zeroes past EOF before and after the dedupe operation? If I fallocate a 16k file, write 'X' into the first 5000 bytes, write 'X' into the first 66,440 bytes (60k + 5000) of a second file, and then try to dedupe (first file, 0-8k) with (second file, 60k-68k), should that work? I'm convinced that we could support dedupe to EOF when the ranges of the two files both end at the respective file's EOF, but it's the weirder corner cases that I worry about... --D > Thanks. > > > --- > > fs/read_write.c | 10 ++++------ > > 1 file changed, 4 insertions(+), 6 deletions(-) > > > > diff --git a/fs/read_write.c b/fs/read_write.c > > index 5bbf587f5bc1..7458fccc59e1 100644 > > --- a/fs/read_write.c > > +++ b/fs/read_write.c > > @@ -1777,10 +1777,9 @@ static int remap_verify_area(struct file *file, loff_t pos, loff_t len, > > * else. Assume that the offsets have already been checked for block > > * alignment. > > * > > - * For deduplication we always scale down to the previous block because we > > - * can't meaningfully compare post-EOF contents. > > - * > > - * For clone we only link a partial EOF block above the destination file's EOF. > > + * For clone we only link a partial EOF block above or at the destination file's > > + * EOF. For deduplication we accept a partial EOF block only if it ends at the > > + * destination file's EOF (can not link it into the middle of a file). > > * > > * Shorten the request if possible. > > */ > > @@ -1796,8 +1795,7 @@ static int generic_remap_check_len(struct inode *inode_in, > > if ((*len & blkmask) == 0) > > return 0; > > > > - if ((remap_flags & REMAP_FILE_DEDUP) || > > - pos_out + *len < i_size_read(inode_out)) > > + if (pos_out + *len < i_size_read(inode_out)) > > new_len &= ~blkmask; > > > > if (new_len == *len) > > -- > > 2.11.0 > >
On Tue, Jan 7, 2020 at 5:57 PM Darrick J. Wong <darrick.wong@oracle.com> wrote: > > On Tue, Jan 07, 2020 at 04:23:15PM +0000, Filipe Manana wrote: > > On Mon, Dec 16, 2019 at 6:28 PM <fdmanana@kernel.org> wrote: > > > > > > From: Filipe Manana <fdmanana@suse.com> > > > > > > We always round down, to a multiple of the filesystem's block size, the > > > length to deduplicate at generic_remap_check_len(). However this is only > > > needed if an attempt to deduplicate the last block into the middle of the > > > destination file is requested, since that leads into a corruption if the > > > length of the source file is not block size aligned. When an attempt to > > > deduplicate the last block into the end of the destination file is > > > requested, we should allow it because it is safe to do it - there's no > > > stale data exposure and we are prepared to compare the data ranges for > > > a length not aligned to the block (or page) size - in fact we even do > > > the data compare before adjusting the deduplication length. > > > > > > After btrfs was updated to use the generic helpers from VFS (by commit > > > 34a28e3d77535e ("Btrfs: use generic_remap_file_range_prep() for cloning > > > and deduplication")) we started to have user reports of deduplication > > > not reflinking the last block anymore, and whence users getting lower > > > deduplication scores. The main use case is deduplication of entire > > > files that have a size not aligned to the block size of the filesystem. > > > > > > We already allow cloning the last block to the end (and beyond) of the > > > destination file, so allow for deduplication as well. > > > > > > Link: https://lore.kernel.org/linux-btrfs/2019-1576167349.500456@svIo.N5dq.dFFD/ > > > Signed-off-by: Filipe Manana <fdmanana@suse.com> > > > > Darrick, Al, any feedback? > > Is there a fstest to check for correct operation of dedupe at or beyond > source and destfile EOF? Particularly if one range is /not/ at EOF? Such as what generic/158 does already? > And that an mmap read of the EOF block will see zeroes past EOF before > and after the dedupe operation? Can you elaborate a bit more? Why an mmap read and not a buffered or a direct IO read before and after deduplication? Is there anything special for the mmap reads on xfs, is that your concern? Or is the idea to deduplicate while the file is mmap'ed? > > If I fallocate a 16k file, write 'X' into the first 5000 bytes, > write 'X' into the first 66,440 bytes (60k + 5000) of a second file, and > then try to dedupe (first file, 0-8k) with (second file, 60k-68k), > should that work? You haven't mentioned the size of the second file, nor if the first file has a size of 16K which I assume (instead of fallocate with the keep size flag). Anyway, I assume you actually meant to dedupe the range 0 - 5000 from the first file into the range 60k - 60k + 5000 of the second file, and that the second file has a size of 60k + 5000. If so, that fails with -EINVAL because the source range is not block size aligned, and we already have generic fstests that test attempt to duplication and clone non-aligned ranges that don't end at eof. This patch doesn't change that behaviour, it only aims to allow deduplication of the eof block of the source file into the eof of the destination file. > > I'm convinced that we could support dedupe to EOF when the ranges of the > two files both end at the respective file's EOF, but it's the weirder > corner cases that I worry about... Well, we used to do that in btrfs before migrating to the generic code. Since I discovered the corruption due to deduplication of the eof block into the middle of a file in 2018's summer, the btrfs fix allowed deduplication of the eof block only if the destination end offset matched the eof of the destination file: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=de02b9f6bb65a6a1848f346f7a3617b7a9b930c0 Since then no issues were found nor users reported any problems so far. Any other specific test you would like to see? Thanks. > > --D > > > Thanks. > > > > > --- > > > fs/read_write.c | 10 ++++------ > > > 1 file changed, 4 insertions(+), 6 deletions(-) > > > > > > diff --git a/fs/read_write.c b/fs/read_write.c > > > index 5bbf587f5bc1..7458fccc59e1 100644 > > > --- a/fs/read_write.c > > > +++ b/fs/read_write.c > > > @@ -1777,10 +1777,9 @@ static int remap_verify_area(struct file *file, loff_t pos, loff_t len, > > > * else. Assume that the offsets have already been checked for block > > > * alignment. > > > * > > > - * For deduplication we always scale down to the previous block because we > > > - * can't meaningfully compare post-EOF contents. > > > - * > > > - * For clone we only link a partial EOF block above the destination file's EOF. > > > + * For clone we only link a partial EOF block above or at the destination file's > > > + * EOF. For deduplication we accept a partial EOF block only if it ends at the > > > + * destination file's EOF (can not link it into the middle of a file). > > > * > > > * Shorten the request if possible. > > > */ > > > @@ -1796,8 +1795,7 @@ static int generic_remap_check_len(struct inode *inode_in, > > > if ((*len & blkmask) == 0) > > > return 0; > > > > > > - if ((remap_flags & REMAP_FILE_DEDUP) || > > > - pos_out + *len < i_size_read(inode_out)) > > > + if (pos_out + *len < i_size_read(inode_out)) > > > new_len &= ~blkmask; > > > > > > if (new_len == *len) > > > -- > > > 2.11.0 > > >
On Wed, Jan 08, 2020 at 11:36:04AM +0000, Filipe Manana wrote: > On Tue, Jan 7, 2020 at 5:57 PM Darrick J. Wong <darrick.wong@oracle.com> wrote: > > > > On Tue, Jan 07, 2020 at 04:23:15PM +0000, Filipe Manana wrote: > > > On Mon, Dec 16, 2019 at 6:28 PM <fdmanana@kernel.org> wrote: > > > > > > > > From: Filipe Manana <fdmanana@suse.com> > > > > > > > > We always round down, to a multiple of the filesystem's block size, the > > > > length to deduplicate at generic_remap_check_len(). However this is only > > > > needed if an attempt to deduplicate the last block into the middle of the > > > > destination file is requested, since that leads into a corruption if the > > > > length of the source file is not block size aligned. When an attempt to > > > > deduplicate the last block into the end of the destination file is > > > > requested, we should allow it because it is safe to do it - there's no > > > > stale data exposure and we are prepared to compare the data ranges for > > > > a length not aligned to the block (or page) size - in fact we even do > > > > the data compare before adjusting the deduplication length. > > > > > > > > After btrfs was updated to use the generic helpers from VFS (by commit > > > > 34a28e3d77535e ("Btrfs: use generic_remap_file_range_prep() for cloning > > > > and deduplication")) we started to have user reports of deduplication > > > > not reflinking the last block anymore, and whence users getting lower > > > > deduplication scores. The main use case is deduplication of entire > > > > files that have a size not aligned to the block size of the filesystem. > > > > > > > > We already allow cloning the last block to the end (and beyond) of the > > > > destination file, so allow for deduplication as well. > > > > > > > > Link: https://lore.kernel.org/linux-btrfs/2019-1576167349.500456@svIo.N5dq.dFFD/ > > > > Signed-off-by: Filipe Manana <fdmanana@suse.com> > > > > > > Darrick, Al, any feedback? > > > > Is there a fstest to check for correct operation of dedupe at or beyond > > source and destfile EOF? Particularly if one range is /not/ at EOF? > > Such as what generic/158 does already? Urk, heh. :) > > And that an mmap read of the EOF block will see zeroes past EOF before > > and after the dedupe operation? > > Can you elaborate a bit more? Why an mmap read and not a buffered or a > direct IO read before and after deduplication? > Is there anything special for the mmap reads on xfs, is that your > concern? Or is the idea to deduplicate while the file is mmap'ed? I cite mmap reads past EOF specifically because unlike buffered/direct reads where the VFS will stop reading exactly at EOF, a memory mapping maps in an entire memory page, and the fs is supposed to ensure that the bytes past EOF are zeroed. Hm now that I look at g/158 it doesn't actually verify mmap reads. I looked around and can't really see anything that checks mmap reads before and after a dedupe operation at EOF. > > If I fallocate a 16k file, write 'X' into the first 5000 bytes, > > write 'X' into the first 66,440 bytes (60k + 5000) of a second file, and > > then try to dedupe (first file, 0-8k) with (second file, 60k-68k), > > should that work? > > You haven't mentioned the size of the second file, nor if the first > file has a size of 16K which I assume (instead of fallocate with the > keep size flag). Er, sorry, yes. The first file is 16,384 bytes long; the second file is 66,440 bytes. > Anyway, I assume you actually meant to dedupe the range 0 - 5000 from > the first file into the range 60k - 60k + 5000 of the second file, and > that the second file has a size of 60k + 5000. Nope, I meant to say to dedupe the range (offset: 0, length: 8192) from the first file into the second file (offset: 61440, length: 8192). The source range is entirely below EOF, and the dest range ends right at EOF in the second file. > If so, that fails with -EINVAL because the source range is not block > size aligned, and we already have generic fstests that test attempt to > duplication and clone non-aligned ranges that don't end at eof. > This patch doesn't change that behaviour, it only aims to allow > deduplication of the eof block of the source file into the eof of the > destination file. > > > > > > I'm convinced that we could support dedupe to EOF when the ranges of the > > two files both end at the respective file's EOF, but it's the weirder > > corner cases that I worry about... > > Well, we used to do that in btrfs before migrating to the generic code. > Since I discovered the corruption due to deduplication of the eof > block into the middle of a file in 2018's summer, the btrfs fix > allowed deduplication of the eof block only if the destination end > offset matched the eof of the destination file: > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=de02b9f6bb65a6a1848f346f7a3617b7a9b930c0 > > Since then no issues were found nor users reported any problems so far. <nod> I'm ok with that one scenario, it's the "one range ends at eof, the other doesn't" case that I'm picking on. :) (Another way to shut me up would be to run generic/52[12] with TIME_FACTOR=1000 (i.e. 1 billion fsx ops) and see what comes exploding out. :)) > Any other specific test you would like to see? No, just that. And mmap reads. :) --D > Thanks. > > > > > --D > > > > > Thanks. > > > > > > > --- > > > > fs/read_write.c | 10 ++++------ > > > > 1 file changed, 4 insertions(+), 6 deletions(-) > > > > > > > > diff --git a/fs/read_write.c b/fs/read_write.c > > > > index 5bbf587f5bc1..7458fccc59e1 100644 > > > > --- a/fs/read_write.c > > > > +++ b/fs/read_write.c > > > > @@ -1777,10 +1777,9 @@ static int remap_verify_area(struct file *file, loff_t pos, loff_t len, > > > > * else. Assume that the offsets have already been checked for block > > > > * alignment. > > > > * > > > > - * For deduplication we always scale down to the previous block because we > > > > - * can't meaningfully compare post-EOF contents. > > > > - * > > > > - * For clone we only link a partial EOF block above the destination file's EOF. > > > > + * For clone we only link a partial EOF block above or at the destination file's > > > > + * EOF. For deduplication we accept a partial EOF block only if it ends at the > > > > + * destination file's EOF (can not link it into the middle of a file). > > > > * > > > > * Shorten the request if possible. > > > > */ > > > > @@ -1796,8 +1795,7 @@ static int generic_remap_check_len(struct inode *inode_in, > > > > if ((*len & blkmask) == 0) > > > > return 0; > > > > > > > > - if ((remap_flags & REMAP_FILE_DEDUP) || > > > > - pos_out + *len < i_size_read(inode_out)) > > > > + if (pos_out + *len < i_size_read(inode_out)) > > > > new_len &= ~blkmask; > > > > > > > > if (new_len == *len) > > > > -- > > > > 2.11.0 > > > >
On Wed, Jan 8, 2020 at 4:15 PM Darrick J. Wong <darrick.wong@oracle.com> wrote: > > On Wed, Jan 08, 2020 at 11:36:04AM +0000, Filipe Manana wrote: > > On Tue, Jan 7, 2020 at 5:57 PM Darrick J. Wong <darrick.wong@oracle.com> wrote: > > > > > > On Tue, Jan 07, 2020 at 04:23:15PM +0000, Filipe Manana wrote: > > > > On Mon, Dec 16, 2019 at 6:28 PM <fdmanana@kernel.org> wrote: > > > > > > > > > > From: Filipe Manana <fdmanana@suse.com> > > > > > > > > > > We always round down, to a multiple of the filesystem's block size, the > > > > > length to deduplicate at generic_remap_check_len(). However this is only > > > > > needed if an attempt to deduplicate the last block into the middle of the > > > > > destination file is requested, since that leads into a corruption if the > > > > > length of the source file is not block size aligned. When an attempt to > > > > > deduplicate the last block into the end of the destination file is > > > > > requested, we should allow it because it is safe to do it - there's no > > > > > stale data exposure and we are prepared to compare the data ranges for > > > > > a length not aligned to the block (or page) size - in fact we even do > > > > > the data compare before adjusting the deduplication length. > > > > > > > > > > After btrfs was updated to use the generic helpers from VFS (by commit > > > > > 34a28e3d77535e ("Btrfs: use generic_remap_file_range_prep() for cloning > > > > > and deduplication")) we started to have user reports of deduplication > > > > > not reflinking the last block anymore, and whence users getting lower > > > > > deduplication scores. The main use case is deduplication of entire > > > > > files that have a size not aligned to the block size of the filesystem. > > > > > > > > > > We already allow cloning the last block to the end (and beyond) of the > > > > > destination file, so allow for deduplication as well. > > > > > > > > > > Link: https://lore.kernel.org/linux-btrfs/2019-1576167349.500456@svIo.N5dq.dFFD/ > > > > > Signed-off-by: Filipe Manana <fdmanana@suse.com> > > > > > > > > Darrick, Al, any feedback? > > > > > > Is there a fstest to check for correct operation of dedupe at or beyond > > > source and destfile EOF? Particularly if one range is /not/ at EOF? > > > > Such as what generic/158 does already? > > Urk, heh. :) > > > > And that an mmap read of the EOF block will see zeroes past EOF before > > > and after the dedupe operation? > > > > Can you elaborate a bit more? Why an mmap read and not a buffered or a > > direct IO read before and after deduplication? > > Is there anything special for the mmap reads on xfs, is that your > > concern? Or is the idea to deduplicate while the file is mmap'ed? > > I cite mmap reads past EOF specifically because unlike buffered/direct > reads where the VFS will stop reading exactly at EOF, a memory mapping > maps in an entire memory page, and the fs is supposed to ensure that the > bytes past EOF are zeroed. Ok, on btrfs we zero out the rest of the page both in for all reads. So, it's not problem: $ file_size=$((64 * 1024 + 500)) $ xfs_io -f -c "pwrite -S 0xba 0 $file_size" foo $ xfs_io -f -c "pwrite -S 0xba 0 $file_size" bar $ sync $ xfs_io -c "mmap 0 68K" -c "mread -v 0 68K" -c "dedupe bar 0 0 $file_size" -c "mread -v 0 68K" foo Both mreads returns exactly the same, eof is still the same and all bytes after it have a value of 0. This is the same as it is for cloning. > > Hm now that I look at g/158 it doesn't actually verify mmap reads. I > looked around and can't really see anything that checks mmap reads > before and after a dedupe operation at EOF. > > > > If I fallocate a 16k file, write 'X' into the first 5000 bytes, > > > write 'X' into the first 66,440 bytes (60k + 5000) of a second file, and > > > then try to dedupe (first file, 0-8k) with (second file, 60k-68k), > > > should that work? > > > > You haven't mentioned the size of the second file, nor if the first > > file has a size of 16K which I assume (instead of fallocate with the > > keep size flag). > > Er, sorry, yes. The first file is 16,384 bytes long; the second file is > 66,440 bytes. > > > Anyway, I assume you actually meant to dedupe the range 0 - 5000 from > > the first file into the range 60k - 60k + 5000 of the second file, and > > that the second file has a size of 60k + 5000. > > Nope, I meant to say to dedupe the range (offset: 0, length: 8192) from > the first file into the second file (offset: 61440, length: 8192). The > source range is entirely below EOF, and the dest range ends right at > EOF in the second file. The dedupe operations fails with -EINVAL, both before and after this patch, since the whenever one of ranges crosses eof, we must fail dedupe with -EINVAL. This happens at generic_remap_checks(), which called before generic_remap_check_len(). This is no different from the other existing tests in fstests, which cover this case already. > > > If so, that fails with -EINVAL because the source range is not block > > size aligned, and we already have generic fstests that test attempt to > > duplication and clone non-aligned ranges that don't end at eof. > > This patch doesn't change that behaviour, it only aims to allow > > deduplication of the eof block of the source file into the eof of the > > destination file. > > > > > > > > > > I'm convinced that we could support dedupe to EOF when the ranges of the > > > two files both end at the respective file's EOF, but it's the weirder > > > corner cases that I worry about... > > > > Well, we used to do that in btrfs before migrating to the generic code. > > Since I discovered the corruption due to deduplication of the eof > > block into the middle of a file in 2018's summer, the btrfs fix > > allowed deduplication of the eof block only if the destination end > > offset matched the eof of the destination file: > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=de02b9f6bb65a6a1848f346f7a3617b7a9b930c0 > > > > Since then no issues were found nor users reported any problems so far. > > <nod> I'm ok with that one scenario, it's the "one range ends at eof, > the other doesn't" case that I'm picking on. :) > > (Another way to shut me up would be to run generic/52[12] with > TIME_FACTOR=1000 (i.e. 1 billion fsx ops) and see what comes exploding > out. :)) Well, it will take nearly 2 weeks to get those tests to complete with 1 billion ops, taking into account each takes around 500 seconds to run here (on xfs) with 1 million operations... > > > Any other specific test you would like to see? > > No, just that. And mmap reads. :) Given that we allow cloning of eof into eof already (and that doesn't cause any known problems), I don't see why you are so concerned with allowing dedupe to behave the same. The only thing I can see it could be a problem would be comparing the contents of the last page beyond the eof position, but as stated on the changelog, it's not a problem since we call vfs_dedupe_file_range_compare() before we round down the length at generic_remap_check_len() - the data compare function already has the correct behaviour. Thanks. > > --D > > > Thanks. > > > > > > > > --D > > > > > > > Thanks. > > > > > > > > > --- > > > > > fs/read_write.c | 10 ++++------ > > > > > 1 file changed, 4 insertions(+), 6 deletions(-) > > > > > > > > > > diff --git a/fs/read_write.c b/fs/read_write.c > > > > > index 5bbf587f5bc1..7458fccc59e1 100644 > > > > > --- a/fs/read_write.c > > > > > +++ b/fs/read_write.c > > > > > @@ -1777,10 +1777,9 @@ static int remap_verify_area(struct file *file, loff_t pos, loff_t len, > > > > > * else. Assume that the offsets have already been checked for block > > > > > * alignment. > > > > > * > > > > > - * For deduplication we always scale down to the previous block because we > > > > > - * can't meaningfully compare post-EOF contents. > > > > > - * > > > > > - * For clone we only link a partial EOF block above the destination file's EOF. > > > > > + * For clone we only link a partial EOF block above or at the destination file's > > > > > + * EOF. For deduplication we accept a partial EOF block only if it ends at the > > > > > + * destination file's EOF (can not link it into the middle of a file). > > > > > * > > > > > * Shorten the request if possible. > > > > > */ > > > > > @@ -1796,8 +1795,7 @@ static int generic_remap_check_len(struct inode *inode_in, > > > > > if ((*len & blkmask) == 0) > > > > > return 0; > > > > > > > > > > - if ((remap_flags & REMAP_FILE_DEDUP) || > > > > > - pos_out + *len < i_size_read(inode_out)) > > > > > + if (pos_out + *len < i_size_read(inode_out)) > > > > > new_len &= ~blkmask; > > > > > > > > > > if (new_len == *len) > > > > > -- > > > > > 2.11.0 > > > > >
On Thu, Jan 09, 2020 at 07:00:09PM +0000, Filipe Manana wrote: > On Wed, Jan 8, 2020 at 4:15 PM Darrick J. Wong <darrick.wong@oracle.com> wrote: > > > > On Wed, Jan 08, 2020 at 11:36:04AM +0000, Filipe Manana wrote: > > > On Tue, Jan 7, 2020 at 5:57 PM Darrick J. Wong <darrick.wong@oracle.com> wrote: > > > > > > > > On Tue, Jan 07, 2020 at 04:23:15PM +0000, Filipe Manana wrote: > > > > > On Mon, Dec 16, 2019 at 6:28 PM <fdmanana@kernel.org> wrote: > > > > > > > > > > > > From: Filipe Manana <fdmanana@suse.com> > > > > > > > > > > > > We always round down, to a multiple of the filesystem's block size, the > > > > > > length to deduplicate at generic_remap_check_len(). However this is only > > > > > > needed if an attempt to deduplicate the last block into the middle of the > > > > > > destination file is requested, since that leads into a corruption if the > > > > > > length of the source file is not block size aligned. When an attempt to > > > > > > deduplicate the last block into the end of the destination file is > > > > > > requested, we should allow it because it is safe to do it - there's no > > > > > > stale data exposure and we are prepared to compare the data ranges for > > > > > > a length not aligned to the block (or page) size - in fact we even do > > > > > > the data compare before adjusting the deduplication length. > > > > > > > > > > > > After btrfs was updated to use the generic helpers from VFS (by commit > > > > > > 34a28e3d77535e ("Btrfs: use generic_remap_file_range_prep() for cloning > > > > > > and deduplication")) we started to have user reports of deduplication > > > > > > not reflinking the last block anymore, and whence users getting lower > > > > > > deduplication scores. The main use case is deduplication of entire > > > > > > files that have a size not aligned to the block size of the filesystem. > > > > > > > > > > > > We already allow cloning the last block to the end (and beyond) of the > > > > > > destination file, so allow for deduplication as well. > > > > > > > > > > > > Link: https://lore.kernel.org/linux-btrfs/2019-1576167349.500456@svIo.N5dq.dFFD/ > > > > > > Signed-off-by: Filipe Manana <fdmanana@suse.com> > > > > > > > > > > Darrick, Al, any feedback? > > > > > > > > Is there a fstest to check for correct operation of dedupe at or beyond > > > > source and destfile EOF? Particularly if one range is /not/ at EOF? > > > > > > Such as what generic/158 does already? > > > > Urk, heh. :) > > > > > > And that an mmap read of the EOF block will see zeroes past EOF before > > > > and after the dedupe operation? > > > > > > Can you elaborate a bit more? Why an mmap read and not a buffered or a > > > direct IO read before and after deduplication? > > > Is there anything special for the mmap reads on xfs, is that your > > > concern? Or is the idea to deduplicate while the file is mmap'ed? > > > > I cite mmap reads past EOF specifically because unlike buffered/direct > > reads where the VFS will stop reading exactly at EOF, a memory mapping > > maps in an entire memory page, and the fs is supposed to ensure that the > > bytes past EOF are zeroed. > > Ok, on btrfs we zero out the rest of the page both in for all reads. > > So, it's not problem: > > $ file_size=$((64 * 1024 + 500)) > $ xfs_io -f -c "pwrite -S 0xba 0 $file_size" foo > $ xfs_io -f -c "pwrite -S 0xba 0 $file_size" bar > $ sync > > $ xfs_io -c "mmap 0 68K" -c "mread -v 0 68K" -c "dedupe bar 0 0 > $file_size" -c "mread -v 0 68K" foo > > Both mreads returns exactly the same, eof is still the same and all > bytes after it have a value of 0. > > This is the same as it is for cloning. Oh good. :) > > > > Hm now that I look at g/158 it doesn't actually verify mmap reads. I > > looked around and can't really see anything that checks mmap reads > > before and after a dedupe operation at EOF. > > > > > > If I fallocate a 16k file, write 'X' into the first 5000 bytes, > > > > write 'X' into the first 66,440 bytes (60k + 5000) of a second file, and > > > > then try to dedupe (first file, 0-8k) with (second file, 60k-68k), > > > > should that work? > > > > > > You haven't mentioned the size of the second file, nor if the first > > > file has a size of 16K which I assume (instead of fallocate with the > > > keep size flag). > > > > Er, sorry, yes. The first file is 16,384 bytes long; the second file is > > 66,440 bytes. > > > > > Anyway, I assume you actually meant to dedupe the range 0 - 5000 from > > > the first file into the range 60k - 60k + 5000 of the second file, and > > > that the second file has a size of 60k + 5000. > > > > Nope, I meant to say to dedupe the range (offset: 0, length: 8192) from > > the first file into the second file (offset: 61440, length: 8192). The > > source range is entirely below EOF, and the dest range ends right at > > EOF in the second file. > > The dedupe operations fails with -EINVAL, both before and after this patch, > since the whenever one of ranges crosses eof, we must fail dedupe with -EINVAL. > This happens at generic_remap_checks(), which called before > generic_remap_check_len(). > > This is no different from the other existing tests in fstests, which > cover this case already. Ah, ok. > > > If so, that fails with -EINVAL because the source range is not block > > > size aligned, and we already have generic fstests that test attempt to > > > duplication and clone non-aligned ranges that don't end at eof. > > > This patch doesn't change that behaviour, it only aims to allow > > > deduplication of the eof block of the source file into the eof of the > > > destination file. > > > > > > > > > > > > > > I'm convinced that we could support dedupe to EOF when the ranges of the > > > > two files both end at the respective file's EOF, but it's the weirder > > > > corner cases that I worry about... > > > > > > Well, we used to do that in btrfs before migrating to the generic code. > > > Since I discovered the corruption due to deduplication of the eof > > > block into the middle of a file in 2018's summer, the btrfs fix > > > allowed deduplication of the eof block only if the destination end > > > offset matched the eof of the destination file: > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=de02b9f6bb65a6a1848f346f7a3617b7a9b930c0 > > > > > > Since then no issues were found nor users reported any problems so far. > > > > <nod> I'm ok with that one scenario, it's the "one range ends at eof, > > the other doesn't" case that I'm picking on. :) > > > > (Another way to shut me up would be to run generic/52[12] with > > TIME_FACTOR=1000 (i.e. 1 billion fsx ops) and see what comes exploding > > out. :)) > > Well, it will take nearly 2 weeks to get those tests to complete with > 1 billion ops, taking into > account each takes around 500 seconds to run here (on xfs) with 1 > million operations... I know. I probably should have clarified: every month or two I build a kernel from tip, throw it in a VM, and let it run for however long it goes until it dies, fixing whatever problems as I find them. *So far* no billionaire plutocrats have complained about the huge compute bills that are probably racking up. ;) (FWIW I also tossed your patch onto another one of these VMs and it hasn't died yet, so that's probably an ok sign) > > > > > Any other specific test you would like to see? > > > > No, just that. And mmap reads. :) > > Given that we allow cloning of eof into eof already (and that doesn't > cause any known problems), > I don't see why you are so concerned with allowing dedupe to behave the same. > > The only thing I can see it could be a problem would be comparing the > contents of the last page beyond the eof position, > but as stated on the changelog, it's not a problem since we call > vfs_dedupe_file_range_compare() before we > round down the length at generic_remap_check_len() - the data compare > function already has the correct behaviour. <nod> --D > Thanks. > > > > > --D > > > > > Thanks. > > > > > > > > > > > --D > > > > > > > > > Thanks. > > > > > > > > > > > --- > > > > > > fs/read_write.c | 10 ++++------ > > > > > > 1 file changed, 4 insertions(+), 6 deletions(-) > > > > > > > > > > > > diff --git a/fs/read_write.c b/fs/read_write.c > > > > > > index 5bbf587f5bc1..7458fccc59e1 100644 > > > > > > --- a/fs/read_write.c > > > > > > +++ b/fs/read_write.c > > > > > > @@ -1777,10 +1777,9 @@ static int remap_verify_area(struct file *file, loff_t pos, loff_t len, > > > > > > * else. Assume that the offsets have already been checked for block > > > > > > * alignment. > > > > > > * > > > > > > - * For deduplication we always scale down to the previous block because we > > > > > > - * can't meaningfully compare post-EOF contents. > > > > > > - * > > > > > > - * For clone we only link a partial EOF block above the destination file's EOF. > > > > > > + * For clone we only link a partial EOF block above or at the destination file's > > > > > > + * EOF. For deduplication we accept a partial EOF block only if it ends at the > > > > > > + * destination file's EOF (can not link it into the middle of a file). > > > > > > * > > > > > > * Shorten the request if possible. > > > > > > */ > > > > > > @@ -1796,8 +1795,7 @@ static int generic_remap_check_len(struct inode *inode_in, > > > > > > if ((*len & blkmask) == 0) > > > > > > return 0; > > > > > > > > > > > > - if ((remap_flags & REMAP_FILE_DEDUP) || > > > > > > - pos_out + *len < i_size_read(inode_out)) > > > > > > + if (pos_out + *len < i_size_read(inode_out)) > > > > > > new_len &= ~blkmask; > > > > > > > > > > > > if (new_len == *len) > > > > > > -- > > > > > > 2.11.0 > > > > > >
On Thu, Jan 9, 2020 at 7:12 PM Darrick J. Wong <darrick.wong@oracle.com> wrote: > > On Thu, Jan 09, 2020 at 07:00:09PM +0000, Filipe Manana wrote: > > On Wed, Jan 8, 2020 at 4:15 PM Darrick J. Wong <darrick.wong@oracle.com> wrote: > > > > > > On Wed, Jan 08, 2020 at 11:36:04AM +0000, Filipe Manana wrote: > > > > On Tue, Jan 7, 2020 at 5:57 PM Darrick J. Wong <darrick.wong@oracle.com> wrote: > > > > > > > > > > On Tue, Jan 07, 2020 at 04:23:15PM +0000, Filipe Manana wrote: > > > > > > On Mon, Dec 16, 2019 at 6:28 PM <fdmanana@kernel.org> wrote: > > > > > > > > > > > > > > From: Filipe Manana <fdmanana@suse.com> > > > > > > > > > > > > > > We always round down, to a multiple of the filesystem's block size, the > > > > > > > length to deduplicate at generic_remap_check_len(). However this is only > > > > > > > needed if an attempt to deduplicate the last block into the middle of the > > > > > > > destination file is requested, since that leads into a corruption if the > > > > > > > length of the source file is not block size aligned. When an attempt to > > > > > > > deduplicate the last block into the end of the destination file is > > > > > > > requested, we should allow it because it is safe to do it - there's no > > > > > > > stale data exposure and we are prepared to compare the data ranges for > > > > > > > a length not aligned to the block (or page) size - in fact we even do > > > > > > > the data compare before adjusting the deduplication length. > > > > > > > > > > > > > > After btrfs was updated to use the generic helpers from VFS (by commit > > > > > > > 34a28e3d77535e ("Btrfs: use generic_remap_file_range_prep() for cloning > > > > > > > and deduplication")) we started to have user reports of deduplication > > > > > > > not reflinking the last block anymore, and whence users getting lower > > > > > > > deduplication scores. The main use case is deduplication of entire > > > > > > > files that have a size not aligned to the block size of the filesystem. > > > > > > > > > > > > > > We already allow cloning the last block to the end (and beyond) of the > > > > > > > destination file, so allow for deduplication as well. > > > > > > > > > > > > > > Link: https://lore.kernel.org/linux-btrfs/2019-1576167349.500456@svIo.N5dq.dFFD/ > > > > > > > Signed-off-by: Filipe Manana <fdmanana@suse.com> > > > > > > > > > > > > Darrick, Al, any feedback? > > > > > > > > > > Is there a fstest to check for correct operation of dedupe at or beyond > > > > > source and destfile EOF? Particularly if one range is /not/ at EOF? > > > > > > > > Such as what generic/158 does already? > > > > > > Urk, heh. :) > > > > > > > > And that an mmap read of the EOF block will see zeroes past EOF before > > > > > and after the dedupe operation? > > > > > > > > Can you elaborate a bit more? Why an mmap read and not a buffered or a > > > > direct IO read before and after deduplication? > > > > Is there anything special for the mmap reads on xfs, is that your > > > > concern? Or is the idea to deduplicate while the file is mmap'ed? > > > > > > I cite mmap reads past EOF specifically because unlike buffered/direct > > > reads where the VFS will stop reading exactly at EOF, a memory mapping > > > maps in an entire memory page, and the fs is supposed to ensure that the > > > bytes past EOF are zeroed. > > > > Ok, on btrfs we zero out the rest of the page both in for all reads. > > > > So, it's not problem: > > > > $ file_size=$((64 * 1024 + 500)) > > $ xfs_io -f -c "pwrite -S 0xba 0 $file_size" foo > > $ xfs_io -f -c "pwrite -S 0xba 0 $file_size" bar > > $ sync > > > > $ xfs_io -c "mmap 0 68K" -c "mread -v 0 68K" -c "dedupe bar 0 0 > > $file_size" -c "mread -v 0 68K" foo > > > > Both mreads returns exactly the same, eof is still the same and all > > bytes after it have a value of 0. > > > > This is the same as it is for cloning. > > Oh good. :) > > > > > > > Hm now that I look at g/158 it doesn't actually verify mmap reads. I > > > looked around and can't really see anything that checks mmap reads > > > before and after a dedupe operation at EOF. > > > > > > > > If I fallocate a 16k file, write 'X' into the first 5000 bytes, > > > > > write 'X' into the first 66,440 bytes (60k + 5000) of a second file, and > > > > > then try to dedupe (first file, 0-8k) with (second file, 60k-68k), > > > > > should that work? > > > > > > > > You haven't mentioned the size of the second file, nor if the first > > > > file has a size of 16K which I assume (instead of fallocate with the > > > > keep size flag). > > > > > > Er, sorry, yes. The first file is 16,384 bytes long; the second file is > > > 66,440 bytes. > > > > > > > Anyway, I assume you actually meant to dedupe the range 0 - 5000 from > > > > the first file into the range 60k - 60k + 5000 of the second file, and > > > > that the second file has a size of 60k + 5000. > > > > > > Nope, I meant to say to dedupe the range (offset: 0, length: 8192) from > > > the first file into the second file (offset: 61440, length: 8192). The > > > source range is entirely below EOF, and the dest range ends right at > > > EOF in the second file. > > > > The dedupe operations fails with -EINVAL, both before and after this patch, > > since the whenever one of ranges crosses eof, we must fail dedupe with -EINVAL. > > This happens at generic_remap_checks(), which called before > > generic_remap_check_len(). > > > > This is no different from the other existing tests in fstests, which > > cover this case already. > > Ah, ok. > > > > > If so, that fails with -EINVAL because the source range is not block > > > > size aligned, and we already have generic fstests that test attempt to > > > > duplication and clone non-aligned ranges that don't end at eof. > > > > This patch doesn't change that behaviour, it only aims to allow > > > > deduplication of the eof block of the source file into the eof of the > > > > destination file. > > > > > > > > > > > > > > > > > > I'm convinced that we could support dedupe to EOF when the ranges of the > > > > > two files both end at the respective file's EOF, but it's the weirder > > > > > corner cases that I worry about... > > > > > > > > Well, we used to do that in btrfs before migrating to the generic code. > > > > Since I discovered the corruption due to deduplication of the eof > > > > block into the middle of a file in 2018's summer, the btrfs fix > > > > allowed deduplication of the eof block only if the destination end > > > > offset matched the eof of the destination file: > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=de02b9f6bb65a6a1848f346f7a3617b7a9b930c0 > > > > > > > > Since then no issues were found nor users reported any problems so far. > > > > > > <nod> I'm ok with that one scenario, it's the "one range ends at eof, > > > the other doesn't" case that I'm picking on. :) > > > > > > (Another way to shut me up would be to run generic/52[12] with > > > TIME_FACTOR=1000 (i.e. 1 billion fsx ops) and see what comes exploding > > > out. :)) > > > > Well, it will take nearly 2 weeks to get those tests to complete with > > 1 billion ops, taking into > > account each takes around 500 seconds to run here (on xfs) with 1 > > million operations... > > I know. I probably should have clarified: every month or two I build a > kernel from tip, throw it in a VM, and let it run for however long it > goes until it dies, fixing whatever problems as I find them. > > *So far* no billionaire plutocrats have complained about the huge > compute bills that are probably racking up. ;) > > (FWIW I also tossed your patch onto another one of these VMs and it > hasn't died yet, so that's probably an ok sign) For 100 million ops, on 5.5-rc5 kernel with this patchset and on xfs: $ TIME_FACTOR=100 MKFS_OPTIONS="-m reflink=1,rmapbt=1" ./check generic/521 generic/522 FSTYP -- xfs (debug) PLATFORM -- Linux/x86_64 debian5 5.5.0-rc5-btrfs-next-67 #1 SMP PREEMPT Thu Jan 9 13:48:07 WET 2020 MKFS_OPTIONS -- -f -m reflink=1,rmapbt=1 /dev/sdc MOUNT_OPTIONS -- /dev/sdc /home/fdmanana/btrfs-tests/scratch_1 generic/521 482s ... 48961s generic/522 536s ... 54432s Ran: generic/521 generic/522 Passed all 2 tests Similar for btrfs, no problems found (yet at least). Thanks. > > > > > > > > Any other specific test you would like to see? > > > > > > No, just that. And mmap reads. :) > > > > Given that we allow cloning of eof into eof already (and that doesn't > > cause any known problems), > > I don't see why you are so concerned with allowing dedupe to behave the same. > > > > The only thing I can see it could be a problem would be comparing the > > contents of the last page beyond the eof position, > > but as stated on the changelog, it's not a problem since we call > > vfs_dedupe_file_range_compare() before we > > round down the length at generic_remap_check_len() - the data compare > > function already has the correct behaviour. > > <nod> > > --D > > > Thanks. > > > > > > > > --D > > > > > > > Thanks. > > > > > > > > > > > > > > --D > > > > > > > > > > > Thanks. > > > > > > > > > > > > > --- > > > > > > > fs/read_write.c | 10 ++++------ > > > > > > > 1 file changed, 4 insertions(+), 6 deletions(-) > > > > > > > > > > > > > > diff --git a/fs/read_write.c b/fs/read_write.c > > > > > > > index 5bbf587f5bc1..7458fccc59e1 100644 > > > > > > > --- a/fs/read_write.c > > > > > > > +++ b/fs/read_write.c > > > > > > > @@ -1777,10 +1777,9 @@ static int remap_verify_area(struct file *file, loff_t pos, loff_t len, > > > > > > > * else. Assume that the offsets have already been checked for block > > > > > > > * alignment. > > > > > > > * > > > > > > > - * For deduplication we always scale down to the previous block because we > > > > > > > - * can't meaningfully compare post-EOF contents. > > > > > > > - * > > > > > > > - * For clone we only link a partial EOF block above the destination file's EOF. > > > > > > > + * For clone we only link a partial EOF block above or at the destination file's > > > > > > > + * EOF. For deduplication we accept a partial EOF block only if it ends at the > > > > > > > + * destination file's EOF (can not link it into the middle of a file). > > > > > > > * > > > > > > > * Shorten the request if possible. > > > > > > > */ > > > > > > > @@ -1796,8 +1795,7 @@ static int generic_remap_check_len(struct inode *inode_in, > > > > > > > if ((*len & blkmask) == 0) > > > > > > > return 0; > > > > > > > > > > > > > > - if ((remap_flags & REMAP_FILE_DEDUP) || > > > > > > > - pos_out + *len < i_size_read(inode_out)) > > > > > > > + if (pos_out + *len < i_size_read(inode_out)) > > > > > > > new_len &= ~blkmask; > > > > > > > > > > > > > > if (new_len == *len) > > > > > > > -- > > > > > > > 2.11.0 > > > > > > >
On Tue, Jan 14, 2020 at 02:36:09PM +0000, Filipe Manana wrote: > On Thu, Jan 9, 2020 at 7:12 PM Darrick J. Wong <darrick.wong@oracle.com> wrote: > > > > On Thu, Jan 09, 2020 at 07:00:09PM +0000, Filipe Manana wrote: > > > On Wed, Jan 8, 2020 at 4:15 PM Darrick J. Wong <darrick.wong@oracle.com> wrote: > > > > > > > > On Wed, Jan 08, 2020 at 11:36:04AM +0000, Filipe Manana wrote: > > > > > On Tue, Jan 7, 2020 at 5:57 PM Darrick J. Wong <darrick.wong@oracle.com> wrote: > > > > > > > > > > > > On Tue, Jan 07, 2020 at 04:23:15PM +0000, Filipe Manana wrote: > > > > > > > On Mon, Dec 16, 2019 at 6:28 PM <fdmanana@kernel.org> wrote: > > > > > > > > > > > > > > > > From: Filipe Manana <fdmanana@suse.com> > > > > > > > > > > > > > > > > We always round down, to a multiple of the filesystem's block size, the > > > > > > > > length to deduplicate at generic_remap_check_len(). However this is only > > > > > > > > needed if an attempt to deduplicate the last block into the middle of the > > > > > > > > destination file is requested, since that leads into a corruption if the > > > > > > > > length of the source file is not block size aligned. When an attempt to > > > > > > > > deduplicate the last block into the end of the destination file is > > > > > > > > requested, we should allow it because it is safe to do it - there's no > > > > > > > > stale data exposure and we are prepared to compare the data ranges for > > > > > > > > a length not aligned to the block (or page) size - in fact we even do > > > > > > > > the data compare before adjusting the deduplication length. > > > > > > > > > > > > > > > > After btrfs was updated to use the generic helpers from VFS (by commit > > > > > > > > 34a28e3d77535e ("Btrfs: use generic_remap_file_range_prep() for cloning > > > > > > > > and deduplication")) we started to have user reports of deduplication > > > > > > > > not reflinking the last block anymore, and whence users getting lower > > > > > > > > deduplication scores. The main use case is deduplication of entire > > > > > > > > files that have a size not aligned to the block size of the filesystem. > > > > > > > > > > > > > > > > We already allow cloning the last block to the end (and beyond) of the > > > > > > > > destination file, so allow for deduplication as well. > > > > > > > > > > > > > > > > Link: https://lore.kernel.org/linux-btrfs/2019-1576167349.500456@svIo.N5dq.dFFD/ > > > > > > > > Signed-off-by: Filipe Manana <fdmanana@suse.com> > > > > > > > > > > > > > > Darrick, Al, any feedback? > > > > > > > > > > > > Is there a fstest to check for correct operation of dedupe at or beyond > > > > > > source and destfile EOF? Particularly if one range is /not/ at EOF? > > > > > > > > > > Such as what generic/158 does already? > > > > > > > > Urk, heh. :) > > > > > > > > > > And that an mmap read of the EOF block will see zeroes past EOF before > > > > > > and after the dedupe operation? > > > > > > > > > > Can you elaborate a bit more? Why an mmap read and not a buffered or a > > > > > direct IO read before and after deduplication? > > > > > Is there anything special for the mmap reads on xfs, is that your > > > > > concern? Or is the idea to deduplicate while the file is mmap'ed? > > > > > > > > I cite mmap reads past EOF specifically because unlike buffered/direct > > > > reads where the VFS will stop reading exactly at EOF, a memory mapping > > > > maps in an entire memory page, and the fs is supposed to ensure that the > > > > bytes past EOF are zeroed. > > > > > > Ok, on btrfs we zero out the rest of the page both in for all reads. > > > > > > So, it's not problem: > > > > > > $ file_size=$((64 * 1024 + 500)) > > > $ xfs_io -f -c "pwrite -S 0xba 0 $file_size" foo > > > $ xfs_io -f -c "pwrite -S 0xba 0 $file_size" bar > > > $ sync > > > > > > $ xfs_io -c "mmap 0 68K" -c "mread -v 0 68K" -c "dedupe bar 0 0 > > > $file_size" -c "mread -v 0 68K" foo > > > > > > Both mreads returns exactly the same, eof is still the same and all > > > bytes after it have a value of 0. > > > > > > This is the same as it is for cloning. > > > > Oh good. :) > > > > > > > > > > Hm now that I look at g/158 it doesn't actually verify mmap reads. I > > > > looked around and can't really see anything that checks mmap reads > > > > before and after a dedupe operation at EOF. > > > > > > > > > > If I fallocate a 16k file, write 'X' into the first 5000 bytes, > > > > > > write 'X' into the first 66,440 bytes (60k + 5000) of a second file, and > > > > > > then try to dedupe (first file, 0-8k) with (second file, 60k-68k), > > > > > > should that work? > > > > > > > > > > You haven't mentioned the size of the second file, nor if the first > > > > > file has a size of 16K which I assume (instead of fallocate with the > > > > > keep size flag). > > > > > > > > Er, sorry, yes. The first file is 16,384 bytes long; the second file is > > > > 66,440 bytes. > > > > > > > > > Anyway, I assume you actually meant to dedupe the range 0 - 5000 from > > > > > the first file into the range 60k - 60k + 5000 of the second file, and > > > > > that the second file has a size of 60k + 5000. > > > > > > > > Nope, I meant to say to dedupe the range (offset: 0, length: 8192) from > > > > the first file into the second file (offset: 61440, length: 8192). The > > > > source range is entirely below EOF, and the dest range ends right at > > > > EOF in the second file. > > > > > > The dedupe operations fails with -EINVAL, both before and after this patch, > > > since the whenever one of ranges crosses eof, we must fail dedupe with -EINVAL. > > > This happens at generic_remap_checks(), which called before > > > generic_remap_check_len(). > > > > > > This is no different from the other existing tests in fstests, which > > > cover this case already. > > > > Ah, ok. > > > > > > > If so, that fails with -EINVAL because the source range is not block > > > > > size aligned, and we already have generic fstests that test attempt to > > > > > duplication and clone non-aligned ranges that don't end at eof. > > > > > This patch doesn't change that behaviour, it only aims to allow > > > > > deduplication of the eof block of the source file into the eof of the > > > > > destination file. > > > > > > > > > > > > > > > > > > > > > > I'm convinced that we could support dedupe to EOF when the ranges of the > > > > > > two files both end at the respective file's EOF, but it's the weirder > > > > > > corner cases that I worry about... > > > > > > > > > > Well, we used to do that in btrfs before migrating to the generic code. > > > > > Since I discovered the corruption due to deduplication of the eof > > > > > block into the middle of a file in 2018's summer, the btrfs fix > > > > > allowed deduplication of the eof block only if the destination end > > > > > offset matched the eof of the destination file: > > > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=de02b9f6bb65a6a1848f346f7a3617b7a9b930c0 > > > > > > > > > > Since then no issues were found nor users reported any problems so far. > > > > > > > > <nod> I'm ok with that one scenario, it's the "one range ends at eof, > > > > the other doesn't" case that I'm picking on. :) > > > > > > > > (Another way to shut me up would be to run generic/52[12] with > > > > TIME_FACTOR=1000 (i.e. 1 billion fsx ops) and see what comes exploding > > > > out. :)) > > > > > > Well, it will take nearly 2 weeks to get those tests to complete with > > > 1 billion ops, taking into > > > account each takes around 500 seconds to run here (on xfs) with 1 > > > million operations... > > > > I know. I probably should have clarified: every month or two I build a > > kernel from tip, throw it in a VM, and let it run for however long it > > goes until it dies, fixing whatever problems as I find them. > > > > *So far* no billionaire plutocrats have complained about the huge > > compute bills that are probably racking up. ;) > > > > (FWIW I also tossed your patch onto another one of these VMs and it > > hasn't died yet, so that's probably an ok sign) > > For 100 million ops, on 5.5-rc5 kernel with this patchset and on xfs: > > $ TIME_FACTOR=100 MKFS_OPTIONS="-m reflink=1,rmapbt=1" ./check > generic/521 generic/522 > FSTYP -- xfs (debug) > PLATFORM -- Linux/x86_64 debian5 5.5.0-rc5-btrfs-next-67 #1 SMP > PREEMPT Thu Jan 9 13:48:07 WET 2020 > MKFS_OPTIONS -- -f -m reflink=1,rmapbt=1 /dev/sdc > MOUNT_OPTIONS -- /dev/sdc /home/fdmanana/btrfs-tests/scratch_1 > > generic/521 482s ... 48961s > generic/522 536s ... 54432s > Ran: generic/521 generic/522 > Passed all 2 tests > > Similar for btrfs, no problems found (yet at least). > > Thanks. > Urk, I never reviewed this, did I... Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> --D > > > > > > > > > > > Any other specific test you would like to see? > > > > > > > > No, just that. And mmap reads. :) > > > > > > Given that we allow cloning of eof into eof already (and that doesn't > > > cause any known problems), > > > I don't see why you are so concerned with allowing dedupe to behave the same. > > > > > > The only thing I can see it could be a problem would be comparing the > > > contents of the last page beyond the eof position, > > > but as stated on the changelog, it's not a problem since we call > > > vfs_dedupe_file_range_compare() before we > > > round down the length at generic_remap_check_len() - the data compare > > > function already has the correct behaviour. > > > > <nod> > > > > --D > > > > > Thanks. > > > > > > > > > > > --D > > > > > > > > > Thanks. > > > > > > > > > > > > > > > > > --D > > > > > > > > > > > > > Thanks. > > > > > > > > > > > > > > > --- > > > > > > > > fs/read_write.c | 10 ++++------ > > > > > > > > 1 file changed, 4 insertions(+), 6 deletions(-) > > > > > > > > > > > > > > > > diff --git a/fs/read_write.c b/fs/read_write.c > > > > > > > > index 5bbf587f5bc1..7458fccc59e1 100644 > > > > > > > > --- a/fs/read_write.c > > > > > > > > +++ b/fs/read_write.c > > > > > > > > @@ -1777,10 +1777,9 @@ static int remap_verify_area(struct file *file, loff_t pos, loff_t len, > > > > > > > > * else. Assume that the offsets have already been checked for block > > > > > > > > * alignment. > > > > > > > > * > > > > > > > > - * For deduplication we always scale down to the previous block because we > > > > > > > > - * can't meaningfully compare post-EOF contents. > > > > > > > > - * > > > > > > > > - * For clone we only link a partial EOF block above the destination file's EOF. > > > > > > > > + * For clone we only link a partial EOF block above or at the destination file's > > > > > > > > + * EOF. For deduplication we accept a partial EOF block only if it ends at the > > > > > > > > + * destination file's EOF (can not link it into the middle of a file). > > > > > > > > * > > > > > > > > * Shorten the request if possible. > > > > > > > > */ > > > > > > > > @@ -1796,8 +1795,7 @@ static int generic_remap_check_len(struct inode *inode_in, > > > > > > > > if ((*len & blkmask) == 0) > > > > > > > > return 0; > > > > > > > > > > > > > > > > - if ((remap_flags & REMAP_FILE_DEDUP) || > > > > > > > > - pos_out + *len < i_size_read(inode_out)) > > > > > > > > + if (pos_out + *len < i_size_read(inode_out)) > > > > > > > > new_len &= ~blkmask; > > > > > > > > > > > > > > > > if (new_len == *len) > > > > > > > > -- > > > > > > > > 2.11.0 > > > > > > > >
On Tue, Jan 21, 2020 at 04:35:32PM -0800, Darrick J. Wong wrote: > Urk, I never reviewed this, did I... > > Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Thanks, so with this we can proceed with merging, the question is how. This is in generic fs/ code but not plain VFS and affecting only btrfs and xfs. I suggest the following: I'll take the patches to a branch separate from other btrfs patches, add rev-by and stable tags and send an extra pull request to Linus. Before that the branch can spend some time in btrfs' for-next among other topic branches so there's linux-next exposure. I don't mean to sidestep VFS maintainers, but previous remap changes don't have Al Viro's signed-off either, so I hope that when at least Darrick is fine with the proposed way then let's do it. If not, please let me know.
diff --git a/fs/read_write.c b/fs/read_write.c index 5bbf587f5bc1..7458fccc59e1 100644 --- a/fs/read_write.c +++ b/fs/read_write.c @@ -1777,10 +1777,9 @@ static int remap_verify_area(struct file *file, loff_t pos, loff_t len, * else. Assume that the offsets have already been checked for block * alignment. * - * For deduplication we always scale down to the previous block because we - * can't meaningfully compare post-EOF contents. - * - * For clone we only link a partial EOF block above the destination file's EOF. + * For clone we only link a partial EOF block above or at the destination file's + * EOF. For deduplication we accept a partial EOF block only if it ends at the + * destination file's EOF (can not link it into the middle of a file). * * Shorten the request if possible. */ @@ -1796,8 +1795,7 @@ static int generic_remap_check_len(struct inode *inode_in, if ((*len & blkmask) == 0) return 0; - if ((remap_flags & REMAP_FILE_DEDUP) || - pos_out + *len < i_size_read(inode_out)) + if (pos_out + *len < i_size_read(inode_out)) new_len &= ~blkmask; if (new_len == *len)