Message ID | 90a1ea4049bbf6d80163aa8116af722280c5d70c.1739771926.git.wqu@suse.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | btrfs-progs: docs: add an extra note to btrfs data checksum and directIO | expand |
Looks good to me,
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
On 17/2/25 13:58, Qu Wenruo wrote: > In v6.14 kernel release, btrfs will force a direct IO to fall back to > a buffered one if the inode requires a data checksum. > > This will cause a small performance drop, to solve the false data > checksum mismatch problem caused by direct IOs. > > Although such a change is small to most end users, for those requiring > such a zero-copy direct IO this will be a behavior change, and this > requires a proper documentation update. > > Signed-off-by: Qu Wenruo <wqu@suse.com> > --- > Changelog: > v2: > - Grammar fixes sugguested by Johannes > --- > Documentation/ch-checksumming.rst | 18 ++++++++++++++++++ > 1 file changed, 18 insertions(+) > > diff --git a/Documentation/ch-checksumming.rst b/Documentation/ch-checksumming.rst > index 5e47a6bfb492..b7fde46fe902 100644 > --- a/Documentation/ch-checksumming.rst > +++ b/Documentation/ch-checksumming.rst > @@ -3,6 +3,24 @@ writing and verified after reading the blocks from devices. The whole metadata > block has an inline checksum stored in the b-tree node header. Each data block > has a detached checksum stored in the checksum tree. > > +.. note:: > + Since a data checksum is calculated just before submitting to the block > + device, btrfs has a strong requirement that the coresponding data block must > + not be modified until the writeback is finished. > + > + This requirement is met for a buffered write as btrfs has the full control on > + its page caches, but a direct write (``O_DIRECT``) bypasses page caches, and > + btrfs can not control the direct IO buffer (as it can be in user space memory), > + thus it's possible that a user space program modifies its direct write buffer > + before the buffer is fully written back, and this can lead to a data checksum mismatch. > + > + To avoid such a checksum mismatch, since v6.14 btrfs will force a direct > + write to fall back to a buffered one, if the inode requires a data checksum. > + This will bring a small performance penalty, and if the end user requires true > + zero-copy direct writes, they should set the ``NODATASUM`` flag for the inode > + and make sure the direct IO buffer is fully aligned to btrfs block size. This section covers how the bug was fixed in v6.14, but that makes you wonder—what about earlier versions? It’d be helpful to add a paragraph on that. Thx. > + > + > There are several checksum algorithms supported. The default and backward > compatible algorithm is *crc32c*. Since kernel 5.5 there are three more with different > characteristics and trade-offs regarding speed and strength. The following list
在 2025/2/18 09:53, Anand Jain 写道: > On 17/2/25 13:58, Qu Wenruo wrote: >> In v6.14 kernel release, btrfs will force a direct IO to fall back to >> a buffered one if the inode requires a data checksum. >> >> This will cause a small performance drop, to solve the false data >> checksum mismatch problem caused by direct IOs. >> >> Although such a change is small to most end users, for those requiring >> such a zero-copy direct IO this will be a behavior change, and this >> requires a proper documentation update. >> >> Signed-off-by: Qu Wenruo <wqu@suse.com> >> --- >> Changelog: >> v2: >> - Grammar fixes sugguested by Johannes >> --- >> Documentation/ch-checksumming.rst | 18 ++++++++++++++++++ >> 1 file changed, 18 insertions(+) >> >> diff --git a/Documentation/ch-checksumming.rst b/Documentation/ch- >> checksumming.rst >> index 5e47a6bfb492..b7fde46fe902 100644 >> --- a/Documentation/ch-checksumming.rst >> +++ b/Documentation/ch-checksumming.rst >> @@ -3,6 +3,24 @@ writing and verified after reading the blocks from >> devices. The whole metadata >> block has an inline checksum stored in the b-tree node header. Each >> data block >> has a detached checksum stored in the checksum tree. >> +.. note:: >> + Since a data checksum is calculated just before submitting to the >> block >> + device, btrfs has a strong requirement that the coresponding data >> block must >> + not be modified until the writeback is finished. >> + >> + This requirement is met for a buffered write as btrfs has the full >> control on >> + its page caches, but a direct write (``O_DIRECT``) bypasses page >> caches, and >> + btrfs can not control the direct IO buffer (as it can be in user >> space memory), >> + thus it's possible that a user space program modifies its direct >> write buffer >> + before the buffer is fully written back, and this can lead to a >> data checksum mismatch. >> + > >> + To avoid such a checksum mismatch, since v6.14 btrfs will force a >> direct >> + write to fall back to a buffered one, if the inode requires a data >> checksum. >> + This will bring a small performance penalty, and if the end user >> requires true >> + zero-copy direct writes, they should set the ``NODATASUM`` flag >> for the inode >> + and make sure the direct IO buffer is fully aligned to btrfs block >> size. > > This section covers how the bug was fixed in v6.14, but that makes > you wonder—what about earlier versions? It’d be helpful to add a > paragraph on that. I'm planning on update this part when the backport lands in corresponding backport branches. But since the patch is not yet even upstreamed, I do not mention it for now. Thanks, Qu > > Thx. > > >> + >> + >> There are several checksum algorithms supported. The default and >> backward >> compatible algorithm is *crc32c*. Since kernel 5.5 there are three >> more with different >> characteristics and trade-offs regarding speed and strength. The >> following list >
diff --git a/Documentation/ch-checksumming.rst b/Documentation/ch-checksumming.rst index 5e47a6bfb492..b7fde46fe902 100644 --- a/Documentation/ch-checksumming.rst +++ b/Documentation/ch-checksumming.rst @@ -3,6 +3,24 @@ writing and verified after reading the blocks from devices. The whole metadata block has an inline checksum stored in the b-tree node header. Each data block has a detached checksum stored in the checksum tree. +.. note:: + Since a data checksum is calculated just before submitting to the block + device, btrfs has a strong requirement that the coresponding data block must + not be modified until the writeback is finished. + + This requirement is met for a buffered write as btrfs has the full control on + its page caches, but a direct write (``O_DIRECT``) bypasses page caches, and + btrfs can not control the direct IO buffer (as it can be in user space memory), + thus it's possible that a user space program modifies its direct write buffer + before the buffer is fully written back, and this can lead to a data checksum mismatch. + + To avoid such a checksum mismatch, since v6.14 btrfs will force a direct + write to fall back to a buffered one, if the inode requires a data checksum. + This will bring a small performance penalty, and if the end user requires true + zero-copy direct writes, they should set the ``NODATASUM`` flag for the inode + and make sure the direct IO buffer is fully aligned to btrfs block size. + + There are several checksum algorithms supported. The default and backward compatible algorithm is *crc32c*. Since kernel 5.5 there are three more with different characteristics and trade-offs regarding speed and strength. The following list
In v6.14 kernel release, btrfs will force a direct IO to fall back to a buffered one if the inode requires a data checksum. This will cause a small performance drop, to solve the false data checksum mismatch problem caused by direct IOs. Although such a change is small to most end users, for those requiring such a zero-copy direct IO this will be a behavior change, and this requires a proper documentation update. Signed-off-by: Qu Wenruo <wqu@suse.com> --- Changelog: v2: - Grammar fixes sugguested by Johannes --- Documentation/ch-checksumming.rst | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+)