Message ID | a50ceebfe3155ab0f1f0018c28ef99bda264c039.1739920169.git.wqu@suse.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | btrfs: fix data overwriting bug during buffered write when block size < page size | expand |
On Tue, Feb 18, 2025 at 11:10 PM Qu Wenruo <wqu@suse.com> wrote: > > [BUG] > When running generic/417 with a btrfs whose block size < page size generic/417 -> generic/418 > (subpage cases), it always fails. > > And the following minimal reproducer is more than enough to trigger it > reliably: > > workload() > { > mkfs.btrfs -s 4k -f $dev > /dev/null > dmesg -C > mount $dev $mnt > $fsstree_dir/src/dio-invalidate-cache -r -b 4096 -n 3 -i 1 -f $mnt/diotest > ret=$? > umount $mnt > stop_trace > if [ $ret -ne 0 ]; then > fail > fi > } > > for (( i = 0; i < 1024; i++)); do > echo "=== $i/$runtime ===" > workload > done > > [CAUSE] > With extra trace printk added to the following functions: > - btrfs_buffered_write() > * Which folio is touched > * The file offset (start) where the buffered write is at > * How many bytes are copied > * The content of the write (the first 2 bytes) > > - submit_one_sector() > * Which folio is touched > * The position inside the folio > > - pagecache_isize_extended() > * The parameters of the function itself > * The parameters of the folio_zero_range() > > Which are enough to show the problem: > > 22.158114: btrfs_buffered_write: folio pos=0 start=0 copied=4096 content=0x0101 > 22.158161: submit_one_sector: r/i=5/257 folio=0 pos=0 content=0x0101 > 22.158609: btrfs_buffered_write: folio pos=0 start=4096 copied=4096 content=0x0101 > 22.158634: btrfs_buffered_write: folio pos=0 start=8192 copied=4096 content=0x0101 > 22.158650: pagecache_isize_extended: folio=0 from=4096 to=8192 bsize=4096 zero off=4096 len=8192 > 22.158682: submit_one_sector: r/i=5/257 folio=0 pos=4096 content=0x0000 > 22.158686: submit_one_sector: r/i=5/257 folio=0 pos=8192 content=0x0101 > > The tool dio-invalidate-cache will start 3 threads, each doing a buffered > write with 0x01 at 4096 * i (i is 0, 1 ,2), do a fsync, then do a direct read, > and compare the read buffer with the write buffer. > > Note that all 3 btrfs_buffered_write() are writing the correct 0x01 into > the page cache. > > But at submit_one_sector(), at file offset 4096, the content is zeroed > out, mostly by pagecache_isize_extended(). > > The race happens like this: > Thread A is writing into range [4K, 8K). > Thread B is writing into range [8K, 12k). > > Thread A | Thread B > -------------------------------------+------------------------------------ > btrfs_buffered_write() | btrfs_buffered_write() > |- old_isize = 4K; | |- old_isize = 4096; > |- btrfs_inode_lock() | | > |- write into folio range [4K, 8K) | | > |- pagecache_isize_extended() | | > | extend isize from 4096 to 8192 | | > | no folio_zero_range() called | | > |- btrfs_inode_lock() | | > | |- btrfs_inode_lock() > | |- write into folio range [8K, 12K) > | |- pagecache_isize_extended() > | | calling folio_zero_range(4K, 8K) > | | This is caused by the old_isize is > | | grabbed too early, without any > | | inode lock. > | |- btrfs_inode_unlock() > > The @old_isize is grabbed without inode lock, causing race between two > buffered write threads and making pagecache_isize_extended() to zero > range which is still containing cached data. > > And this is only affecting subpage btrfs, because for regular blocksize > == page size case, the function pagecache_isize_extended() will do > nothing if the block size >= page size. > > [FIX] > Grab the old isize with inode lock hold. hold -> held Or easier to understand: Grab the old i_size while holding the inode lock. > This means each buffered write thread will have a stable view of the > old inode size, thus avoid the above race. > > Cc: stable@vger.kernel.org > Signed-off-by: Qu Wenruo <wqu@suse.com> Fixes: 5e8b9ef30392 ("btrfs: move pos increment and pagecache extension to btrfs_buffered_write") That is the commit that introduced the race, before it we were grabbing i_size and doing the pagecache_isize_extended() call while holding the inode lock. > --- > fs/btrfs/file.c | 9 ++++++++- > 1 file changed, 8 insertions(+), 1 deletion(-) > > diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c > index fd90855fe717..896dc03689d6 100644 > --- a/fs/btrfs/file.c > +++ b/fs/btrfs/file.c > @@ -1090,7 +1090,7 @@ ssize_t btrfs_buffered_write(struct kiocb *iocb, struct iov_iter *i) > u64 lockend; > size_t num_written = 0; > ssize_t ret; > - loff_t old_isize = i_size_read(inode); > + loff_t old_isize; > unsigned int ilock_flags = 0; > const bool nowait = (iocb->ki_flags & IOCB_NOWAIT); > unsigned int bdp_flags = (nowait ? BDP_ASYNC : 0); > @@ -1103,6 +1103,13 @@ ssize_t btrfs_buffered_write(struct kiocb *iocb, struct iov_iter *i) > if (ret < 0) > return ret; > > + /* > + * We can only trust the isize with inode lock hold, or it can race with Same here, hold -> held With those things addressed: Reviewed-by: Filipe Manana <fdmanana@suse.com> Thanks. > + * other buffered writes and cause incorrect call of > + * pagecache_isize_extended() to overwrite existing data. > + */ > + old_isize = i_size_read(inode); > + > ret = generic_write_checks(iocb, i); > if (ret <= 0) > goto out; > -- > 2.48.1 > >
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index fd90855fe717..896dc03689d6 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -1090,7 +1090,7 @@ ssize_t btrfs_buffered_write(struct kiocb *iocb, struct iov_iter *i) u64 lockend; size_t num_written = 0; ssize_t ret; - loff_t old_isize = i_size_read(inode); + loff_t old_isize; unsigned int ilock_flags = 0; const bool nowait = (iocb->ki_flags & IOCB_NOWAIT); unsigned int bdp_flags = (nowait ? BDP_ASYNC : 0); @@ -1103,6 +1103,13 @@ ssize_t btrfs_buffered_write(struct kiocb *iocb, struct iov_iter *i) if (ret < 0) return ret; + /* + * We can only trust the isize with inode lock hold, or it can race with + * other buffered writes and cause incorrect call of + * pagecache_isize_extended() to overwrite existing data. + */ + old_isize = i_size_read(inode); + ret = generic_write_checks(iocb, i); if (ret <= 0) goto out;
[BUG] When running generic/417 with a btrfs whose block size < page size (subpage cases), it always fails. And the following minimal reproducer is more than enough to trigger it reliably: workload() { mkfs.btrfs -s 4k -f $dev > /dev/null dmesg -C mount $dev $mnt $fsstree_dir/src/dio-invalidate-cache -r -b 4096 -n 3 -i 1 -f $mnt/diotest ret=$? umount $mnt stop_trace if [ $ret -ne 0 ]; then fail fi } for (( i = 0; i < 1024; i++)); do echo "=== $i/$runtime ===" workload done [CAUSE] With extra trace printk added to the following functions: - btrfs_buffered_write() * Which folio is touched * The file offset (start) where the buffered write is at * How many bytes are copied * The content of the write (the first 2 bytes) - submit_one_sector() * Which folio is touched * The position inside the folio - pagecache_isize_extended() * The parameters of the function itself * The parameters of the folio_zero_range() Which are enough to show the problem: 22.158114: btrfs_buffered_write: folio pos=0 start=0 copied=4096 content=0x0101 22.158161: submit_one_sector: r/i=5/257 folio=0 pos=0 content=0x0101 22.158609: btrfs_buffered_write: folio pos=0 start=4096 copied=4096 content=0x0101 22.158634: btrfs_buffered_write: folio pos=0 start=8192 copied=4096 content=0x0101 22.158650: pagecache_isize_extended: folio=0 from=4096 to=8192 bsize=4096 zero off=4096 len=8192 22.158682: submit_one_sector: r/i=5/257 folio=0 pos=4096 content=0x0000 22.158686: submit_one_sector: r/i=5/257 folio=0 pos=8192 content=0x0101 The tool dio-invalidate-cache will start 3 threads, each doing a buffered write with 0x01 at 4096 * i (i is 0, 1 ,2), do a fsync, then do a direct read, and compare the read buffer with the write buffer. Note that all 3 btrfs_buffered_write() are writing the correct 0x01 into the page cache. But at submit_one_sector(), at file offset 4096, the content is zeroed out, mostly by pagecache_isize_extended(). The race happens like this: Thread A is writing into range [4K, 8K). Thread B is writing into range [8K, 12k). Thread A | Thread B -------------------------------------+------------------------------------ btrfs_buffered_write() | btrfs_buffered_write() |- old_isize = 4K; | |- old_isize = 4096; |- btrfs_inode_lock() | | |- write into folio range [4K, 8K) | | |- pagecache_isize_extended() | | | extend isize from 4096 to 8192 | | | no folio_zero_range() called | | |- btrfs_inode_lock() | | | |- btrfs_inode_lock() | |- write into folio range [8K, 12K) | |- pagecache_isize_extended() | | calling folio_zero_range(4K, 8K) | | This is caused by the old_isize is | | grabbed too early, without any | | inode lock. | |- btrfs_inode_unlock() The @old_isize is grabbed without inode lock, causing race between two buffered write threads and making pagecache_isize_extended() to zero range which is still containing cached data. And this is only affecting subpage btrfs, because for regular blocksize == page size case, the function pagecache_isize_extended() will do nothing if the block size >= page size. [FIX] Grab the old isize with inode lock hold. This means each buffered write thread will have a stable view of the old inode size, thus avoid the above race. Cc: stable@vger.kernel.org Signed-off-by: Qu Wenruo <wqu@suse.com> --- fs/btrfs/file.c | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-)