diff mbox series

btrfs: fix data overwriting bug during buffered write when block size < page size

Message ID a50ceebfe3155ab0f1f0018c28ef99bda264c039.1739920169.git.wqu@suse.com (mailing list archive)
State New
Headers show
Series btrfs: fix data overwriting bug during buffered write when block size < page size | expand

Commit Message

Qu Wenruo Feb. 18, 2025, 11:09 p.m. UTC
[BUG]
When running generic/417 with a btrfs whose block size < page size
(subpage cases), it always fails.

And the following minimal reproducer is more than enough to trigger it
reliably:

workload()
{
        mkfs.btrfs -s 4k -f $dev > /dev/null
        dmesg -C
        mount $dev $mnt
        $fsstree_dir/src/dio-invalidate-cache -r -b 4096 -n 3 -i 1 -f $mnt/diotest
        ret=$?
        umount $mnt
        stop_trace
        if [ $ret -ne 0 ]; then
                fail
        fi
}

for (( i = 0; i < 1024; i++)); do
        echo "=== $i/$runtime ==="
        workload
done

[CAUSE]
With extra trace printk added to the following functions:
- btrfs_buffered_write()
  * Which folio is touched
  * The file offset (start) where the buffered write is at
  * How many bytes are copied
  * The content of the write (the first 2 bytes)

- submit_one_sector()
  * Which folio is touched
  * The position inside the folio

- pagecache_isize_extended()
  * The parameters of the function itself
  * The parameters of the folio_zero_range()

Which are enough to show the problem:

  22.158114: btrfs_buffered_write: folio pos=0 start=0 copied=4096 content=0x0101
  22.158161: submit_one_sector: r/i=5/257 folio=0 pos=0 content=0x0101
  22.158609: btrfs_buffered_write: folio pos=0 start=4096 copied=4096 content=0x0101
  22.158634: btrfs_buffered_write: folio pos=0 start=8192 copied=4096 content=0x0101
  22.158650: pagecache_isize_extended: folio=0 from=4096 to=8192 bsize=4096 zero off=4096 len=8192
  22.158682: submit_one_sector: r/i=5/257 folio=0 pos=4096 content=0x0000
  22.158686: submit_one_sector: r/i=5/257 folio=0 pos=8192 content=0x0101

The tool dio-invalidate-cache will start 3 threads, each doing a buffered
write with 0x01 at 4096 * i (i is 0, 1 ,2), do a fsync, then do a direct read,
and compare the read buffer with the write buffer.

Note that all 3 btrfs_buffered_write() are writing the correct 0x01 into
the page cache.

But at submit_one_sector(), at file offset 4096, the content is zeroed
out, mostly by pagecache_isize_extended().

The race happens like this:
 Thread A is writing into range [4K, 8K).
 Thread B is writing into range [8K, 12k).

               Thread A              |         Thread B
-------------------------------------+------------------------------------
btrfs_buffered_write()               | btrfs_buffered_write()
|- old_isize = 4K;                   | |- old_isize = 4096;
|- btrfs_inode_lock()                | |
|- write into folio range [4K, 8K)   | |
|- pagecache_isize_extended()        | |
|  extend isize from 4096 to 8192    | |
|  no folio_zero_range() called      | |
|- btrfs_inode_lock()                | |
                                     | |- btrfs_inode_lock()
				     | |- write into folio range [8K, 12K)
				     | |- pagecache_isize_extended()
				     | |  calling folio_zero_range(4K, 8K)
				     | |  This is caused by the old_isize is
				     | |  grabbed too early, without any
				     | |  inode lock.
				     | |- btrfs_inode_unlock()

The @old_isize is grabbed without inode lock, causing race between two
buffered write threads and making pagecache_isize_extended() to zero
range which is still containing cached data.

And this is only affecting subpage btrfs, because for regular blocksize
== page size case, the function pagecache_isize_extended() will do
nothing if the block size >= page size.

[FIX]
Grab the old isize with inode lock hold.
This means each buffered write thread will have a stable view of the
old inode size, thus avoid the above race.

Cc: stable@vger.kernel.org
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/file.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

Comments

Filipe Manana Feb. 19, 2025, 10:45 a.m. UTC | #1
On Tue, Feb 18, 2025 at 11:10 PM Qu Wenruo <wqu@suse.com> wrote:
>
> [BUG]
> When running generic/417 with a btrfs whose block size < page size

generic/417 -> generic/418

> (subpage cases), it always fails.
>
> And the following minimal reproducer is more than enough to trigger it
> reliably:
>
> workload()
> {
>         mkfs.btrfs -s 4k -f $dev > /dev/null
>         dmesg -C
>         mount $dev $mnt
>         $fsstree_dir/src/dio-invalidate-cache -r -b 4096 -n 3 -i 1 -f $mnt/diotest
>         ret=$?
>         umount $mnt
>         stop_trace
>         if [ $ret -ne 0 ]; then
>                 fail
>         fi
> }
>
> for (( i = 0; i < 1024; i++)); do
>         echo "=== $i/$runtime ==="
>         workload
> done
>
> [CAUSE]
> With extra trace printk added to the following functions:
> - btrfs_buffered_write()
>   * Which folio is touched
>   * The file offset (start) where the buffered write is at
>   * How many bytes are copied
>   * The content of the write (the first 2 bytes)
>
> - submit_one_sector()
>   * Which folio is touched
>   * The position inside the folio
>
> - pagecache_isize_extended()
>   * The parameters of the function itself
>   * The parameters of the folio_zero_range()
>
> Which are enough to show the problem:
>
>   22.158114: btrfs_buffered_write: folio pos=0 start=0 copied=4096 content=0x0101
>   22.158161: submit_one_sector: r/i=5/257 folio=0 pos=0 content=0x0101
>   22.158609: btrfs_buffered_write: folio pos=0 start=4096 copied=4096 content=0x0101
>   22.158634: btrfs_buffered_write: folio pos=0 start=8192 copied=4096 content=0x0101
>   22.158650: pagecache_isize_extended: folio=0 from=4096 to=8192 bsize=4096 zero off=4096 len=8192
>   22.158682: submit_one_sector: r/i=5/257 folio=0 pos=4096 content=0x0000
>   22.158686: submit_one_sector: r/i=5/257 folio=0 pos=8192 content=0x0101
>
> The tool dio-invalidate-cache will start 3 threads, each doing a buffered
> write with 0x01 at 4096 * i (i is 0, 1 ,2), do a fsync, then do a direct read,
> and compare the read buffer with the write buffer.
>
> Note that all 3 btrfs_buffered_write() are writing the correct 0x01 into
> the page cache.
>
> But at submit_one_sector(), at file offset 4096, the content is zeroed
> out, mostly by pagecache_isize_extended().
>
> The race happens like this:
>  Thread A is writing into range [4K, 8K).
>  Thread B is writing into range [8K, 12k).
>
>                Thread A              |         Thread B
> -------------------------------------+------------------------------------
> btrfs_buffered_write()               | btrfs_buffered_write()
> |- old_isize = 4K;                   | |- old_isize = 4096;
> |- btrfs_inode_lock()                | |
> |- write into folio range [4K, 8K)   | |
> |- pagecache_isize_extended()        | |
> |  extend isize from 4096 to 8192    | |
> |  no folio_zero_range() called      | |
> |- btrfs_inode_lock()                | |
>                                      | |- btrfs_inode_lock()
>                                      | |- write into folio range [8K, 12K)
>                                      | |- pagecache_isize_extended()
>                                      | |  calling folio_zero_range(4K, 8K)
>                                      | |  This is caused by the old_isize is
>                                      | |  grabbed too early, without any
>                                      | |  inode lock.
>                                      | |- btrfs_inode_unlock()
>
> The @old_isize is grabbed without inode lock, causing race between two
> buffered write threads and making pagecache_isize_extended() to zero
> range which is still containing cached data.
>
> And this is only affecting subpage btrfs, because for regular blocksize
> == page size case, the function pagecache_isize_extended() will do
> nothing if the block size >= page size.
>
> [FIX]
> Grab the old isize with inode lock hold.

hold -> held

Or easier to understand:

Grab the old i_size while holding the inode lock.

> This means each buffered write thread will have a stable view of the
> old inode size, thus avoid the above race.
>
> Cc: stable@vger.kernel.org
> Signed-off-by: Qu Wenruo <wqu@suse.com>

Fixes: 5e8b9ef30392 ("btrfs: move pos increment and pagecache
extension to btrfs_buffered_write")

That is the commit that introduced the race, before it we were
grabbing i_size and doing the pagecache_isize_extended() call while
holding the inode lock.

> ---
>  fs/btrfs/file.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index fd90855fe717..896dc03689d6 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -1090,7 +1090,7 @@ ssize_t btrfs_buffered_write(struct kiocb *iocb, struct iov_iter *i)
>         u64 lockend;
>         size_t num_written = 0;
>         ssize_t ret;
> -       loff_t old_isize = i_size_read(inode);
> +       loff_t old_isize;
>         unsigned int ilock_flags = 0;
>         const bool nowait = (iocb->ki_flags & IOCB_NOWAIT);
>         unsigned int bdp_flags = (nowait ? BDP_ASYNC : 0);
> @@ -1103,6 +1103,13 @@ ssize_t btrfs_buffered_write(struct kiocb *iocb, struct iov_iter *i)
>         if (ret < 0)
>                 return ret;
>
> +       /*
> +        * We can only trust the isize with inode lock hold, or it can race with

Same here, hold -> held

With those things addressed:

Reviewed-by: Filipe Manana <fdmanana@suse.com>

Thanks.

> +        * other buffered writes and cause incorrect call of
> +        * pagecache_isize_extended() to overwrite existing data.
> +        */
> +       old_isize = i_size_read(inode);
> +
>         ret = generic_write_checks(iocb, i);
>         if (ret <= 0)
>                 goto out;
> --
> 2.48.1
>
>
diff mbox series

Patch

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index fd90855fe717..896dc03689d6 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1090,7 +1090,7 @@  ssize_t btrfs_buffered_write(struct kiocb *iocb, struct iov_iter *i)
 	u64 lockend;
 	size_t num_written = 0;
 	ssize_t ret;
-	loff_t old_isize = i_size_read(inode);
+	loff_t old_isize;
 	unsigned int ilock_flags = 0;
 	const bool nowait = (iocb->ki_flags & IOCB_NOWAIT);
 	unsigned int bdp_flags = (nowait ? BDP_ASYNC : 0);
@@ -1103,6 +1103,13 @@  ssize_t btrfs_buffered_write(struct kiocb *iocb, struct iov_iter *i)
 	if (ret < 0)
 		return ret;
 
+	/*
+	 * We can only trust the isize with inode lock hold, or it can race with
+	 * other buffered writes and cause incorrect call of
+	 * pagecache_isize_extended() to overwrite existing data.
+	 */
+	old_isize = i_size_read(inode);
+
 	ret = generic_write_checks(iocb, i);
 	if (ret <= 0)
 		goto out;