diff mbox series

btrfs-progs: docs: add an extra note to btrfs data checksum and directIO

Message ID 90a1ea4049bbf6d80163aa8116af722280c5d70c.1739771926.git.wqu@suse.com (mailing list archive)
State New
Headers show
Series btrfs-progs: docs: add an extra note to btrfs data checksum and directIO | expand

Commit Message

Qu Wenruo Feb. 17, 2025, 5:58 a.m. UTC
In v6.14 kernel release, btrfs will force a direct IO to fall back to
a buffered one if the inode requires a data checksum.

This will cause a small performance drop, to solve the false data
checksum mismatch problem caused by direct IOs.

Although such a change is small to most end users, for those requiring
such a zero-copy direct IO this will be a behavior change, and this
requires a proper documentation update.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
Changelog:
v2:
- Grammar fixes sugguested by Johannes
---
 Documentation/ch-checksumming.rst | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

Comments

Johannes Thumshirn Feb. 17, 2025, 7:13 a.m. UTC | #1
Looks good to me,
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Anand Jain Feb. 17, 2025, 11:23 p.m. UTC | #2
On 17/2/25 13:58, Qu Wenruo wrote:
> In v6.14 kernel release, btrfs will force a direct IO to fall back to
> a buffered one if the inode requires a data checksum.
> 
> This will cause a small performance drop, to solve the false data
> checksum mismatch problem caused by direct IOs.
> 
> Although such a change is small to most end users, for those requiring
> such a zero-copy direct IO this will be a behavior change, and this
> requires a proper documentation update.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
> Changelog:
> v2:
> - Grammar fixes sugguested by Johannes
> ---
>   Documentation/ch-checksumming.rst | 18 ++++++++++++++++++
>   1 file changed, 18 insertions(+)
> 
> diff --git a/Documentation/ch-checksumming.rst b/Documentation/ch-checksumming.rst
> index 5e47a6bfb492..b7fde46fe902 100644
> --- a/Documentation/ch-checksumming.rst
> +++ b/Documentation/ch-checksumming.rst
> @@ -3,6 +3,24 @@ writing and verified after reading the blocks from devices. The whole metadata
>   block has an inline checksum stored in the b-tree node header. Each data block
>   has a detached checksum stored in the checksum tree.
>   
> +.. note::
> +   Since a data checksum is calculated just before submitting to the block
> +   device, btrfs has a strong requirement that the coresponding data block must
> +   not be modified until the writeback is finished.
> +
> +   This requirement is met for a buffered write as btrfs has the full control on
> +   its page caches, but a direct write (``O_DIRECT``) bypasses page caches, and
> +   btrfs can not control the direct IO buffer (as it can be in user space memory),
> +   thus it's possible that a user space program modifies its direct write buffer
> +   before the buffer is fully written back, and this can lead to a data checksum mismatch.
> +

> +   To avoid such a checksum mismatch, since v6.14 btrfs will force a direct
> +   write to fall back to a buffered one, if the inode requires a data checksum.
> +   This will bring a small performance penalty, and if the end user requires true
> +   zero-copy direct writes, they should set the ``NODATASUM`` flag for the inode
> +   and make sure the direct IO buffer is fully aligned to btrfs block size.

This section covers how the bug was fixed in v6.14, but that makes
you wonder—what about earlier versions? It’d be helpful to add a
paragraph on that.

Thx.


> +
> +
>   There are several checksum algorithms supported. The default and backward
>   compatible algorithm is *crc32c*. Since kernel 5.5 there are three more with different
>   characteristics and trade-offs regarding speed and strength. The following list
Qu Wenruo Feb. 17, 2025, 11:41 p.m. UTC | #3
在 2025/2/18 09:53, Anand Jain 写道:
> On 17/2/25 13:58, Qu Wenruo wrote:
>> In v6.14 kernel release, btrfs will force a direct IO to fall back to
>> a buffered one if the inode requires a data checksum.
>>
>> This will cause a small performance drop, to solve the false data
>> checksum mismatch problem caused by direct IOs.
>>
>> Although such a change is small to most end users, for those requiring
>> such a zero-copy direct IO this will be a behavior change, and this
>> requires a proper documentation update.
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>> Changelog:
>> v2:
>> - Grammar fixes sugguested by Johannes
>> ---
>>   Documentation/ch-checksumming.rst | 18 ++++++++++++++++++
>>   1 file changed, 18 insertions(+)
>>
>> diff --git a/Documentation/ch-checksumming.rst b/Documentation/ch- 
>> checksumming.rst
>> index 5e47a6bfb492..b7fde46fe902 100644
>> --- a/Documentation/ch-checksumming.rst
>> +++ b/Documentation/ch-checksumming.rst
>> @@ -3,6 +3,24 @@ writing and verified after reading the blocks from 
>> devices. The whole metadata
>>   block has an inline checksum stored in the b-tree node header. Each 
>> data block
>>   has a detached checksum stored in the checksum tree.
>> +.. note::
>> +   Since a data checksum is calculated just before submitting to the 
>> block
>> +   device, btrfs has a strong requirement that the coresponding data 
>> block must
>> +   not be modified until the writeback is finished.
>> +
>> +   This requirement is met for a buffered write as btrfs has the full 
>> control on
>> +   its page caches, but a direct write (``O_DIRECT``) bypasses page 
>> caches, and
>> +   btrfs can not control the direct IO buffer (as it can be in user 
>> space memory),
>> +   thus it's possible that a user space program modifies its direct 
>> write buffer
>> +   before the buffer is fully written back, and this can lead to a 
>> data checksum mismatch.
>> +
> 
>> +   To avoid such a checksum mismatch, since v6.14 btrfs will force a 
>> direct
>> +   write to fall back to a buffered one, if the inode requires a data 
>> checksum.
>> +   This will bring a small performance penalty, and if the end user 
>> requires true
>> +   zero-copy direct writes, they should set the ``NODATASUM`` flag 
>> for the inode
>> +   and make sure the direct IO buffer is fully aligned to btrfs block 
>> size.
> 
> This section covers how the bug was fixed in v6.14, but that makes
> you wonder—what about earlier versions? It’d be helpful to add a
> paragraph on that.

I'm planning on update this part when the backport lands in 
corresponding backport branches.

But since the patch is not yet even upstreamed, I do not mention it for now.

Thanks,
Qu

> 
> Thx.
> 
> 
>> +
>> +
>>   There are several checksum algorithms supported. The default and 
>> backward
>>   compatible algorithm is *crc32c*. Since kernel 5.5 there are three 
>> more with different
>>   characteristics and trade-offs regarding speed and strength. The 
>> following list
>
diff mbox series

Patch

diff --git a/Documentation/ch-checksumming.rst b/Documentation/ch-checksumming.rst
index 5e47a6bfb492..b7fde46fe902 100644
--- a/Documentation/ch-checksumming.rst
+++ b/Documentation/ch-checksumming.rst
@@ -3,6 +3,24 @@  writing and verified after reading the blocks from devices. The whole metadata
 block has an inline checksum stored in the b-tree node header. Each data block
 has a detached checksum stored in the checksum tree.
 
+.. note::
+   Since a data checksum is calculated just before submitting to the block
+   device, btrfs has a strong requirement that the coresponding data block must
+   not be modified until the writeback is finished.
+
+   This requirement is met for a buffered write as btrfs has the full control on
+   its page caches, but a direct write (``O_DIRECT``) bypasses page caches, and
+   btrfs can not control the direct IO buffer (as it can be in user space memory),
+   thus it's possible that a user space program modifies its direct write buffer
+   before the buffer is fully written back, and this can lead to a data checksum mismatch.
+
+   To avoid such a checksum mismatch, since v6.14 btrfs will force a direct
+   write to fall back to a buffered one, if the inode requires a data checksum.
+   This will bring a small performance penalty, and if the end user requires true
+   zero-copy direct writes, they should set the ``NODATASUM`` flag for the inode
+   and make sure the direct IO buffer is fully aligned to btrfs block size.
+
+
 There are several checksum algorithms supported. The default and backward
 compatible algorithm is *crc32c*. Since kernel 5.5 there are three more with different
 characteristics and trade-offs regarding speed and strength. The following list