diff mbox series

[v2] btrfs: always fallback to buffered write if the inode requires checksum

Message ID e9b8716e2d613cac27e59ceb141f973540f40eef.1738639778.git.wqu@suse.com (mailing list archive)
State New
Headers show
Series [v2] btrfs: always fallback to buffered write if the inode requires checksum | expand

Commit Message

Qu Wenruo Feb. 4, 2025, 3:30 a.m. UTC
[BUG]
It is a long known bug that VM image on btrfs can lead to data csum
mismatch, if the qemu is using direct-io for the image (this is commonly
known as cache mode none).

[CAUSE]
Inside the VM, if the fs is EXT4 or XFS, or even NTFS from Windows, the
fs is allowed to dirty/modify the folio even the folio is under
writeback (as long as the address space doesn't have AS_STABLE_WRITES
flag inherited from the block device).

This is a valid optimization to improve the concurrency, and since these
filesystems have no extra checksum on data, the content change is not a
problem at all.

But the final write into the image file is handled by btrfs, which need
the content not to be modified during writeback, or the checksum will
not match the data (checksum is calculated before submitting the bio).

So EXT4/XFS/NTRFS believes they can modify the folio under writeback,
but btrfs requires no modification, this leads to the false csum
mismatch.

This is only a controlled example, there are even cases where
multi-thread programs can submit a direct IO write, then another thread
modifies the direct IO buffer for whatever reason.

For such cases, btrfs has no sane way to detect such cases and leads to
false data csum mismatch.

[FIX]
I have considered the following ideas to solve the problem:

- Make direct IO to always skip data checksum
  This not only requires a new incompatible flag, as it breaks the
  current per-inode NODATASUM flag.
  But also requires extra handling for no csum found cases.

  And this also reduces our checksum protection.

- Let hardware to handle all the checksum
  AKA, just nodatasum mount option.
  That requires trust for hardware (which is not that trustful in a lot
  of cases), and it's not generic at all.

- Always fallback to buffered write if the inode requires checksum
  This is suggested by Christoph, and is the solution utilized by this
  patch.

  The cost is obvious, the extra buffer copying into page cache, thus it
  reduce the performance.
  But at least it's still user configurable, if the end user still wants
  the zero-copy performance, just set NODATASUM flag for the inode
  (which is a common practice for VM images on btrfs).

  Since we can not trust user space programs to keep the buffer
  consistent during direct IO, we have no choice but always falling
  back to buffered IO.
  At least by this, we avoid the more deadly false data checksum
  mismatch error.

Suggested-by: hch@infradead.org <hch@infradead.org>
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
Changelog:
v2:
- Move the checksum check just after check_direct_IO()
  So that we don't need the extra ENOTBLK error code trick to fallback
  to buffered IO.

- Slightly improve the comment
  Adds that only direct write is affected, and why buffered write is
  fine.
---
 fs/btrfs/direct-io.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

Comments

Filipe Manana Feb. 4, 2025, 11:19 a.m. UTC | #1
On Tue, Feb 4, 2025 at 3:31 AM Qu Wenruo <wqu@suse.com> wrote:
>
> [BUG]
> It is a long known bug that VM image on btrfs can lead to data csum
> mismatch, if the qemu is using direct-io for the image (this is commonly
> known as cache mode none).
>
> [CAUSE]
> Inside the VM, if the fs is EXT4 or XFS, or even NTFS from Windows, the
> fs is allowed to dirty/modify the folio even the folio is under
> writeback (as long as the address space doesn't have AS_STABLE_WRITES
> flag inherited from the block device).
>
> This is a valid optimization to improve the concurrency, and since these
> filesystems have no extra checksum on data, the content change is not a
> problem at all.
>
> But the final write into the image file is handled by btrfs, which need
> the content not to be modified during writeback, or the checksum will
> not match the data (checksum is calculated before submitting the bio).
>
> So EXT4/XFS/NTRFS believes they can modify the folio under writeback,
> but btrfs requires no modification, this leads to the false csum
> mismatch.
>
> This is only a controlled example, there are even cases where
> multi-thread programs can submit a direct IO write, then another thread
> modifies the direct IO buffer for whatever reason.
>
> For such cases, btrfs has no sane way to detect such cases and leads to
> false data csum mismatch.
>
> [FIX]
> I have considered the following ideas to solve the problem:
>
> - Make direct IO to always skip data checksum
>   This not only requires a new incompatible flag, as it breaks the
>   current per-inode NODATASUM flag.
>   But also requires extra handling for no csum found cases.
>
>   And this also reduces our checksum protection.
>
> - Let hardware to handle all the checksum
>   AKA, just nodatasum mount option.
>   That requires trust for hardware (which is not that trustful in a lot
>   of cases), and it's not generic at all.
>
> - Always fallback to buffered write if the inode requires checksum
>   This is suggested by Christoph, and is the solution utilized by this
>   patch.
>
>   The cost is obvious, the extra buffer copying into page cache, thus it
>   reduce the performance.
>   But at least it's still user configurable, if the end user still wants
>   the zero-copy performance, just set NODATASUM flag for the inode
>   (which is a common practice for VM images on btrfs).
>
>   Since we can not trust user space programs to keep the buffer
>   consistent during direct IO, we have no choice but always falling
>   back to buffered IO.
>   At least by this, we avoid the more deadly false data checksum
>   mismatch error.
>
> Suggested-by: hch@infradead.org <hch@infradead.org>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
> Changelog:
> v2:
> - Move the checksum check just after check_direct_IO()
>   So that we don't need the extra ENOTBLK error code trick to fallback
>   to buffered IO.
>
> - Slightly improve the comment
>   Adds that only direct write is affected, and why buffered write is
>   fine.
> ---
>  fs/btrfs/direct-io.c | 15 +++++++++++++++
>  1 file changed, 15 insertions(+)
>
> diff --git a/fs/btrfs/direct-io.c b/fs/btrfs/direct-io.c
> index c99ceabcd792..0de4397130be 100644
> --- a/fs/btrfs/direct-io.c
> +++ b/fs/btrfs/direct-io.c
> @@ -855,6 +855,21 @@ ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
>                 btrfs_inode_unlock(BTRFS_I(inode), ilock_flags);
>                 goto buffered;
>         }
> +       /*
> +        * For direct IO write, we have no control on the folios passed in,
> +        * thus the content can change halfway after we calculated the data
> +        * checksum, and result data checksum mismatch and unable to read
> +        * the data out anymore.

I would phrase this differently to be more clear:

We can't control the folios being passed in, applications can write to
them while a
direct IO write is in progress. This means the content might change
after we calculate the data
checksum. Therefore we can end up storing a checksum that doesn't
match the persisted data.

Otherwise it looks fine to me:

Reviewed-by: Filipe Manana <fdmanana@suse.com>

Thanks.

> +        *
> +        * To be extra safe and avoid false data checksum mismatch, if the
> +        * inode requires data checksum, just fallback to buffered IO.
> +        * For buffered IO we have full control of page cache and can ensure
> +        * no one is modifying the content during writeback.
> +        */
> +       if (!(BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM)) {
> +               btrfs_inode_unlock(BTRFS_I(inode), ilock_flags);
> +               goto buffered;
> +       }
>
>         /*
>          * The iov_iter can be mapped to the same file range we are writing to.
> --
> 2.48.1
>
>
Filipe Manana Feb. 5, 2025, 7:12 p.m. UTC | #2
On Tue, Feb 4, 2025 at 11:19 AM Filipe Manana <fdmanana@kernel.org> wrote:
>
> On Tue, Feb 4, 2025 at 3:31 AM Qu Wenruo <wqu@suse.com> wrote:
> >
> > [BUG]
> > It is a long known bug that VM image on btrfs can lead to data csum
> > mismatch, if the qemu is using direct-io for the image (this is commonly
> > known as cache mode none).
> >
> > [CAUSE]
> > Inside the VM, if the fs is EXT4 or XFS, or even NTFS from Windows, the
> > fs is allowed to dirty/modify the folio even the folio is under
> > writeback (as long as the address space doesn't have AS_STABLE_WRITES
> > flag inherited from the block device).
> >
> > This is a valid optimization to improve the concurrency, and since these
> > filesystems have no extra checksum on data, the content change is not a
> > problem at all.
> >
> > But the final write into the image file is handled by btrfs, which need
> > the content not to be modified during writeback, or the checksum will
> > not match the data (checksum is calculated before submitting the bio).
> >
> > So EXT4/XFS/NTRFS believes they can modify the folio under writeback,
> > but btrfs requires no modification, this leads to the false csum
> > mismatch.
> >
> > This is only a controlled example, there are even cases where
> > multi-thread programs can submit a direct IO write, then another thread
> > modifies the direct IO buffer for whatever reason.
> >
> > For such cases, btrfs has no sane way to detect such cases and leads to
> > false data csum mismatch.
> >
> > [FIX]
> > I have considered the following ideas to solve the problem:
> >
> > - Make direct IO to always skip data checksum
> >   This not only requires a new incompatible flag, as it breaks the
> >   current per-inode NODATASUM flag.
> >   But also requires extra handling for no csum found cases.
> >
> >   And this also reduces our checksum protection.
> >
> > - Let hardware to handle all the checksum
> >   AKA, just nodatasum mount option.
> >   That requires trust for hardware (which is not that trustful in a lot
> >   of cases), and it's not generic at all.
> >
> > - Always fallback to buffered write if the inode requires checksum
> >   This is suggested by Christoph, and is the solution utilized by this
> >   patch.
> >
> >   The cost is obvious, the extra buffer copying into page cache, thus it
> >   reduce the performance.
> >   But at least it's still user configurable, if the end user still wants
> >   the zero-copy performance, just set NODATASUM flag for the inode
> >   (which is a common practice for VM images on btrfs).
> >
> >   Since we can not trust user space programs to keep the buffer
> >   consistent during direct IO, we have no choice but always falling
> >   back to buffered IO.
> >   At least by this, we avoid the more deadly false data checksum
> >   mismatch error.
> >
> > Suggested-by: hch@infradead.org <hch@infradead.org>
> > Signed-off-by: Qu Wenruo <wqu@suse.com>
> > ---
> > Changelog:
> > v2:
> > - Move the checksum check just after check_direct_IO()
> >   So that we don't need the extra ENOTBLK error code trick to fallback
> >   to buffered IO.
> >
> > - Slightly improve the comment
> >   Adds that only direct write is affected, and why buffered write is
> >   fine.
> > ---
> >  fs/btrfs/direct-io.c | 15 +++++++++++++++
> >  1 file changed, 15 insertions(+)
> >
> > diff --git a/fs/btrfs/direct-io.c b/fs/btrfs/direct-io.c
> > index c99ceabcd792..0de4397130be 100644
> > --- a/fs/btrfs/direct-io.c
> > +++ b/fs/btrfs/direct-io.c
> > @@ -855,6 +855,21 @@ ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
> >                 btrfs_inode_unlock(BTRFS_I(inode), ilock_flags);
> >                 goto buffered;
> >         }
> > +       /*
> > +        * For direct IO write, we have no control on the folios passed in,
> > +        * thus the content can change halfway after we calculated the data
> > +        * checksum, and result data checksum mismatch and unable to read
> > +        * the data out anymore.
>
> I would phrase this differently to be more clear:
>
> We can't control the folios being passed in, applications can write to
> them while a
> direct IO write is in progress. This means the content might change
> after we calculate the data
> checksum. Therefore we can end up storing a checksum that doesn't
> match the persisted data.
>
> Otherwise it looks fine to me:
>
> Reviewed-by: Filipe Manana <fdmanana@suse.com>

Btw, did you actually run fstests with this?

At least one test is failing after this change:

btrfs/226 2s ... - output mismatch (see
/home/fdmanana/git/hub/xfstests/results//btrfs/226.out.bad)
    --- tests/btrfs/226.out 2024-05-20 11:27:55.933394605 +0100
    +++ /home/fdmanana/git/hub/xfstests/results//btrfs/226.out.bad
2025-02-05 19:09:33.990188790 +0000
    @@ -39,14 +39,11 @@
     Testing write against prealloc extent at eof
     wrote 65536/65536 bytes at offset 0
     XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
    -wrote 65536/65536 bytes at offset 65536
    -XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
    +pwrite: Resource temporarily unavailable
     File after write:
    ...
    (Run 'diff -u /home/fdmanana/git/hub/xfstests/tests/btrfs/226.out
/home/fdmanana/git/hub/xfstests/results//btrfs/226.out.bad'  to see
the entire diff)



>
> Thanks.
>
> > +        *
> > +        * To be extra safe and avoid false data checksum mismatch, if the
> > +        * inode requires data checksum, just fallback to buffered IO.
> > +        * For buffered IO we have full control of page cache and can ensure
> > +        * no one is modifying the content during writeback.
> > +        */
> > +       if (!(BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM)) {
> > +               btrfs_inode_unlock(BTRFS_I(inode), ilock_flags);
> > +               goto buffered;
> > +       }
> >
> >         /*
> >          * The iov_iter can be mapped to the same file range we are writing to.
> > --
> > 2.48.1
> >
> >
Qu Wenruo Feb. 5, 2025, 9 p.m. UTC | #3
在 2025/2/6 05:42, Filipe Manana 写道:
> On Tue, Feb 4, 2025 at 11:19 AM Filipe Manana <fdmanana@kernel.org> wrote:
>>
>> On Tue, Feb 4, 2025 at 3:31 AM Qu Wenruo <wqu@suse.com> wrote:
>>>
>>> [BUG]
>>> It is a long known bug that VM image on btrfs can lead to data csum
>>> mismatch, if the qemu is using direct-io for the image (this is commonly
>>> known as cache mode none).
>>>
>>> [CAUSE]
>>> Inside the VM, if the fs is EXT4 or XFS, or even NTFS from Windows, the
>>> fs is allowed to dirty/modify the folio even the folio is under
>>> writeback (as long as the address space doesn't have AS_STABLE_WRITES
>>> flag inherited from the block device).
>>>
>>> This is a valid optimization to improve the concurrency, and since these
>>> filesystems have no extra checksum on data, the content change is not a
>>> problem at all.
>>>
>>> But the final write into the image file is handled by btrfs, which need
>>> the content not to be modified during writeback, or the checksum will
>>> not match the data (checksum is calculated before submitting the bio).
>>>
>>> So EXT4/XFS/NTRFS believes they can modify the folio under writeback,
>>> but btrfs requires no modification, this leads to the false csum
>>> mismatch.
>>>
>>> This is only a controlled example, there are even cases where
>>> multi-thread programs can submit a direct IO write, then another thread
>>> modifies the direct IO buffer for whatever reason.
>>>
>>> For such cases, btrfs has no sane way to detect such cases and leads to
>>> false data csum mismatch.
>>>
>>> [FIX]
>>> I have considered the following ideas to solve the problem:
>>>
>>> - Make direct IO to always skip data checksum
>>>    This not only requires a new incompatible flag, as it breaks the
>>>    current per-inode NODATASUM flag.
>>>    But also requires extra handling for no csum found cases.
>>>
>>>    And this also reduces our checksum protection.
>>>
>>> - Let hardware to handle all the checksum
>>>    AKA, just nodatasum mount option.
>>>    That requires trust for hardware (which is not that trustful in a lot
>>>    of cases), and it's not generic at all.
>>>
>>> - Always fallback to buffered write if the inode requires checksum
>>>    This is suggested by Christoph, and is the solution utilized by this
>>>    patch.
>>>
>>>    The cost is obvious, the extra buffer copying into page cache, thus it
>>>    reduce the performance.
>>>    But at least it's still user configurable, if the end user still wants
>>>    the zero-copy performance, just set NODATASUM flag for the inode
>>>    (which is a common practice for VM images on btrfs).
>>>
>>>    Since we can not trust user space programs to keep the buffer
>>>    consistent during direct IO, we have no choice but always falling
>>>    back to buffered IO.
>>>    At least by this, we avoid the more deadly false data checksum
>>>    mismatch error.
>>>
>>> Suggested-by: hch@infradead.org <hch@infradead.org>
>>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>>> ---
>>> Changelog:
>>> v2:
>>> - Move the checksum check just after check_direct_IO()
>>>    So that we don't need the extra ENOTBLK error code trick to fallback
>>>    to buffered IO.
>>>
>>> - Slightly improve the comment
>>>    Adds that only direct write is affected, and why buffered write is
>>>    fine.
>>> ---
>>>   fs/btrfs/direct-io.c | 15 +++++++++++++++
>>>   1 file changed, 15 insertions(+)
>>>
>>> diff --git a/fs/btrfs/direct-io.c b/fs/btrfs/direct-io.c
>>> index c99ceabcd792..0de4397130be 100644
>>> --- a/fs/btrfs/direct-io.c
>>> +++ b/fs/btrfs/direct-io.c
>>> @@ -855,6 +855,21 @@ ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
>>>                  btrfs_inode_unlock(BTRFS_I(inode), ilock_flags);
>>>                  goto buffered;
>>>          }
>>> +       /*
>>> +        * For direct IO write, we have no control on the folios passed in,
>>> +        * thus the content can change halfway after we calculated the data
>>> +        * checksum, and result data checksum mismatch and unable to read
>>> +        * the data out anymore.
>>
>> I would phrase this differently to be more clear:
>>
>> We can't control the folios being passed in, applications can write to
>> them while a
>> direct IO write is in progress. This means the content might change
>> after we calculate the data
>> checksum. Therefore we can end up storing a checksum that doesn't
>> match the persisted data.
>>
>> Otherwise it looks fine to me:
>>
>> Reviewed-by: Filipe Manana <fdmanana@suse.com>
> 
> Btw, did you actually run fstests with this?
> 
> At least one test is failing after this change:
> 
> btrfs/226 2s ... - output mismatch (see
> /home/fdmanana/git/hub/xfstests/results//btrfs/226.out.bad)
>      --- tests/btrfs/226.out 2024-05-20 11:27:55.933394605 +0100
>      +++ /home/fdmanana/git/hub/xfstests/results//btrfs/226.out.bad
> 2025-02-05 19:09:33.990188790 +0000
>      @@ -39,14 +39,11 @@
>       Testing write against prealloc extent at eof
>       wrote 65536/65536 bytes at offset 0
>       XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
>      -wrote 65536/65536 bytes at offset 65536
>      -XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
>      +pwrite: Resource temporarily unavailable

That's caused by NOWAIT flag, we rejects NOWAIT direct-io if it falls 
back to buffered one.

This should be fixed through the test case.

Or do you mean that NOWAIT direct-IO is pretty common and this change is 
going to break a lot of user space tools?

Thanks,
Qu

>       File after write:
>      ...
>      (Run 'diff -u /home/fdmanana/git/hub/xfstests/tests/btrfs/226.out
> /home/fdmanana/git/hub/xfstests/results//btrfs/226.out.bad'  to see
> the entire diff)
> 
> 
> 
>>
>> Thanks.
>>
>>> +        *
>>> +        * To be extra safe and avoid false data checksum mismatch, if the
>>> +        * inode requires data checksum, just fallback to buffered IO.
>>> +        * For buffered IO we have full control of page cache and can ensure
>>> +        * no one is modifying the content during writeback.
>>> +        */
>>> +       if (!(BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM)) {
>>> +               btrfs_inode_unlock(BTRFS_I(inode), ilock_flags);
>>> +               goto buffered;
>>> +       }
>>>
>>>          /*
>>>           * The iov_iter can be mapped to the same file range we are writing to.
>>> --
>>> 2.48.1
>>>
>>>
Christoph Hellwig Feb. 6, 2025, 2:13 p.m. UTC | #4
On Tue, Feb 04, 2025 at 02:00:23PM +1030, Qu Wenruo wrote:
> [BUG]
> It is a long known bug that VM image on btrfs can lead to data csum
> mismatch, if the qemu is using direct-io for the image (this is commonly
> known as cache mode none).
> 
> [CAUSE]
> Inside the VM, if the fs is EXT4 or XFS, or even NTFS from Windows, the
> fs is allowed to dirty/modify the folio even the folio is under
> writeback (as long as the address space doesn't have AS_STABLE_WRITES
> flag inherited from the block device).

Btw, can you add an xfstests that reproduces this by modifying pages
under direct I/O?  That would be really helpful to verify the code when
we want to turn back on real direct I/O eventually when the VM is fixed
to prevent the modifications.
Qu Wenruo Feb. 6, 2025, 8:32 p.m. UTC | #5
在 2025/2/7 00:43, hch@infradead.org 写道:
> On Tue, Feb 04, 2025 at 02:00:23PM +1030, Qu Wenruo wrote:
>> [BUG]
>> It is a long known bug that VM image on btrfs can lead to data csum
>> mismatch, if the qemu is using direct-io for the image (this is commonly
>> known as cache mode none).
>>
>> [CAUSE]
>> Inside the VM, if the fs is EXT4 or XFS, or even NTFS from Windows, the
>> fs is allowed to dirty/modify the folio even the folio is under
>> writeback (as long as the address space doesn't have AS_STABLE_WRITES
>> flag inherited from the block device).
>
> Btw, can you add an xfstests that reproduces this by modifying pages
> under direct I/O?  That would be really helpful to verify the code when
> we want to turn back on real direct I/O eventually when the VM is fixed
> to prevent the modifications.

Sure, although I'm afraid it will need to be a C program instead.

Thanks,
Qu
David Sterba Feb. 13, 2025, 8:23 p.m. UTC | #6
On Tue, Feb 04, 2025 at 02:00:23PM +1030, Qu Wenruo wrote:
> [BUG]
> It is a long known bug that VM image on btrfs can lead to data csum
> mismatch, if the qemu is using direct-io for the image (this is commonly
> known as cache mode none).
> 
> [CAUSE]
> Inside the VM, if the fs is EXT4 or XFS, or even NTFS from Windows, the
> fs is allowed to dirty/modify the folio even the folio is under
> writeback (as long as the address space doesn't have AS_STABLE_WRITES
> flag inherited from the block device).
> 
> This is a valid optimization to improve the concurrency, and since these
> filesystems have no extra checksum on data, the content change is not a
> problem at all.
> 
> But the final write into the image file is handled by btrfs, which need
> the content not to be modified during writeback, or the checksum will
> not match the data (checksum is calculated before submitting the bio).
> 
> So EXT4/XFS/NTRFS believes they can modify the folio under writeback,
> but btrfs requires no modification, this leads to the false csum
> mismatch.
> 
> This is only a controlled example, there are even cases where
> multi-thread programs can submit a direct IO write, then another thread
> modifies the direct IO buffer for whatever reason.
> 
> For such cases, btrfs has no sane way to detect such cases and leads to
> false data csum mismatch.
> 
> [FIX]
> I have considered the following ideas to solve the problem:
> 
> - Make direct IO to always skip data checksum
>   This not only requires a new incompatible flag, as it breaks the
>   current per-inode NODATASUM flag.
>   But also requires extra handling for no csum found cases.
> 
>   And this also reduces our checksum protection.
> 
> - Let hardware to handle all the checksum
>   AKA, just nodatasum mount option.
>   That requires trust for hardware (which is not that trustful in a lot
>   of cases), and it's not generic at all.
> 
> - Always fallback to buffered write if the inode requires checksum
>   This is suggested by Christoph, and is the solution utilized by this
>   patch.
> 
>   The cost is obvious, the extra buffer copying into page cache, thus it
>   reduce the performance.
>   But at least it's still user configurable, if the end user still wants
>   the zero-copy performance, just set NODATASUM flag for the inode
>   (which is a common practice for VM images on btrfs).
> 
>   Since we can not trust user space programs to keep the buffer
>   consistent during direct IO, we have no choice but always falling
>   back to buffered IO.
>   At least by this, we avoid the more deadly false data checksum
>   mismatch error.
> 
> Suggested-by: hch@infradead.org <hch@infradead.org>
> Signed-off-by: Qu Wenruo <wqu@suse.com>

I had this patch in the -rc queue but I think this is a significant
change in the semantics of DIO so the target is 6.14. A backport to
older stable tree is possible but we may need a bit longer period before
this happens. DIO is used for speed in the VMs so falling back to the
buffered write will likely be noticed.
Qu Wenruo Feb. 13, 2025, 8:32 p.m. UTC | #7
在 2025/2/14 06:53, David Sterba 写道:
> On Tue, Feb 04, 2025 at 02:00:23PM +1030, Qu Wenruo wrote:
>> [BUG]
>> It is a long known bug that VM image on btrfs can lead to data csum
>> mismatch, if the qemu is using direct-io for the image (this is commonly
>> known as cache mode none).
>>
>> [CAUSE]
>> Inside the VM, if the fs is EXT4 or XFS, or even NTFS from Windows, the
>> fs is allowed to dirty/modify the folio even the folio is under
>> writeback (as long as the address space doesn't have AS_STABLE_WRITES
>> flag inherited from the block device).
>>
>> This is a valid optimization to improve the concurrency, and since these
>> filesystems have no extra checksum on data, the content change is not a
>> problem at all.
>>
>> But the final write into the image file is handled by btrfs, which need
>> the content not to be modified during writeback, or the checksum will
>> not match the data (checksum is calculated before submitting the bio).
>>
>> So EXT4/XFS/NTRFS believes they can modify the folio under writeback,
>> but btrfs requires no modification, this leads to the false csum
>> mismatch.
>>
>> This is only a controlled example, there are even cases where
>> multi-thread programs can submit a direct IO write, then another thread
>> modifies the direct IO buffer for whatever reason.
>>
>> For such cases, btrfs has no sane way to detect such cases and leads to
>> false data csum mismatch.
>>
>> [FIX]
>> I have considered the following ideas to solve the problem:
>>
>> - Make direct IO to always skip data checksum
>>    This not only requires a new incompatible flag, as it breaks the
>>    current per-inode NODATASUM flag.
>>    But also requires extra handling for no csum found cases.
>>
>>    And this also reduces our checksum protection.
>>
>> - Let hardware to handle all the checksum
>>    AKA, just nodatasum mount option.
>>    That requires trust for hardware (which is not that trustful in a lot
>>    of cases), and it's not generic at all.
>>
>> - Always fallback to buffered write if the inode requires checksum
>>    This is suggested by Christoph, and is the solution utilized by this
>>    patch.
>>
>>    The cost is obvious, the extra buffer copying into page cache, thus it
>>    reduce the performance.
>>    But at least it's still user configurable, if the end user still wants
>>    the zero-copy performance, just set NODATASUM flag for the inode
>>    (which is a common practice for VM images on btrfs).
>>
>>    Since we can not trust user space programs to keep the buffer
>>    consistent during direct IO, we have no choice but always falling
>>    back to buffered IO.
>>    At least by this, we avoid the more deadly false data checksum
>>    mismatch error.
>>
>> Suggested-by: hch@infradead.org <hch@infradead.org>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>
> I had this patch in the -rc queue but I think this is a significant
> change in the semantics of DIO so the target is 6.14. A backport to
> older stable tree is possible but we may need a bit longer period before
> this happens. DIO is used for speed in the VMs so falling back to the
> buffered write will likely be noticed.
>

I can prepare a doc update in btrfs-progs, and I totally understand your
concern about the fallback.

However I'd argue that, for VM useage, the root reason for using direct
IO is not really for performance, but to avoid double page cache as the
VM also manage its own page cache, there is no need to double page caching.

Furthermore there are more common cases where btrfs falls back to
buffered IO. The most common one is when the DIO is only 512 aligned but
not btrfs block aligned (normally 4K).
Considering not every DIO users do the proper fs block size detection, I
believe for a lot of cases, DIO of VMs are already falling back to buffered.

Thus I'm still optimistic about the behavior change, especially
considering the extra overhead is just a memcpy, it should not be that
obvious to end users anyway.

Thanks,
Qu
diff mbox series

Patch

diff --git a/fs/btrfs/direct-io.c b/fs/btrfs/direct-io.c
index c99ceabcd792..0de4397130be 100644
--- a/fs/btrfs/direct-io.c
+++ b/fs/btrfs/direct-io.c
@@ -855,6 +855,21 @@  ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
 		btrfs_inode_unlock(BTRFS_I(inode), ilock_flags);
 		goto buffered;
 	}
+	/*
+	 * For direct IO write, we have no control on the folios passed in,
+	 * thus the content can change halfway after we calculated the data
+	 * checksum, and result data checksum mismatch and unable to read
+	 * the data out anymore.
+	 *
+	 * To be extra safe and avoid false data checksum mismatch, if the
+	 * inode requires data checksum, just fallback to buffered IO.
+	 * For buffered IO we have full control of page cache and can ensure
+	 * no one is modifying the content during writeback.
+	 */
+	if (!(BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM)) {
+		btrfs_inode_unlock(BTRFS_I(inode), ilock_flags);
+		goto buffered;
+	}
 
 	/*
 	 * The iov_iter can be mapped to the same file range we are writing to.