mbox series

[RFC,0/3] Btrfs checksum offload

Message ID 20250129140207.22718-1-joshi.k@samsung.com (mailing list archive)
Headers show
Series Btrfs checksum offload | expand

Message

Kanchan Joshi Jan. 29, 2025, 2:02 p.m. UTC
TL;DR first: this makes Btrfs chuck its checksum tree and leverage NVMe
SSD for data checksumming.

Now, the longer version for why/how.

End-to-end data protection (E2EDP)-capable drives require the transfer
of integrity metadata (PI).
This is currently handled by the block layer, without filesystem
involvement/awareness.
The block layer attaches the metadata buffer, generates the checksum
(and reftag) for write I/O, and verifies it during read I/O.

Btrfs has its own data and metadata checksumming, which is currently
disconnected from the above.
It maintains a separate on-device 'checksum tree' for data checksums,
while the block layer will also be checksumming each Btrfs I/O.

There is value in avoiding Copy-on-write (COW) checksum tree on
a device that can anyway store checksums inline (as part of PI).
This would eliminate extra checksum writes/reads, making I/O
more CPU-efficient.
Additionally, usable space would increase, and write
amplification, both in Btrfs and eventually at the device level, would
be reduced [*].

NVMe drives can also automatically insert and strip the PI/checksum
and provide a per-I/O control knob (the PRACT bit) for this.
Block layer currently makes no attempt to know/advertise this offload.

This patch series: (a) adds checksum offload awareness to the
block layer (patch #1),
(b) enables the NVMe driver to register and support the offload
(patch #2), and
(c) introduces an opt-in (datasum_offload mount option) in Btrfs to
apply checksum offload for data (patch #3).

[*] Here are some perf/write-amplification numbers from randwrite test [1]
on 3 configs (same device):
Config 1: No meta format (4K) + Btrfs (base)
Config 2: Meta format (4K + 8b) + Btrfs (base)
Config 3: Meta format (4K + 8b) + Btrfs (datasum_offload)

In config 1 and 2, Btrfs will operate with a checksum tree.
Only in config 2, block-layer will attach integrity buffer with each I/O and
do checksum/reftag verification.
Only in config 3, offload will take place and device will generate/verify
the checksum.

AppW: writes issued by app, 120G (4 Jobs, each writing 30G)
FsW: writes issued to device (from iostat)
ExtraW: extra writes compared to AppW

Direct I/O
---------------------------------------------------------
Config		IOPS(K)		FsW(G)		ExtraW(G)
1		144		186		66
2		141		181		61
3		172		129		9

Buffered I/O
---------------------------------------------------------
Config		IOPS(K)		FsW(G)		ExtraW(G)
1		82		255		135
2		80		181		132
3		100		199		79

Write amplification is generally high (and that's understandable given
B-trees) but not sure why buffered I/O shows that much.

[1] fio --name=btrfswrite --ioengine=io_uring --directory=/mnt --blocksize=4k --readwrite=randwrite --filesize=30G --numjobs=4 --iodepth=32 --randseed=0 --direct=1 -output=out --group_reporting


Kanchan Joshi (3):
  block: add integrity offload
  nvme: support integrity offload
  btrfs: add checksum offload

 block/bio-integrity.c     | 42 ++++++++++++++++++++++++++++++++++++++-
 block/t10-pi.c            |  7 +++++++
 drivers/nvme/host/core.c  | 24 ++++++++++++++++++++++
 drivers/nvme/host/nvme.h  |  1 +
 fs/btrfs/bio.c            | 12 +++++++++++
 fs/btrfs/fs.h             |  1 +
 fs/btrfs/super.c          |  9 +++++++++
 include/linux/blk_types.h |  3 +++
 include/linux/blkdev.h    |  7 +++++++
 9 files changed, 105 insertions(+), 1 deletion(-)

Comments

Johannes Thumshirn Jan. 29, 2025, 2:55 p.m. UTC | #1
On 29.01.25 15:13, Kanchan Joshi wrote:
> TL;DR first: this makes Btrfs chuck its checksum tree and leverage NVMe
> SSD for data checksumming.
> 
> Now, the longer version for why/how.
> 
> End-to-end data protection (E2EDP)-capable drives require the transfer
> of integrity metadata (PI).
> This is currently handled by the block layer, without filesystem
> involvement/awareness.
> The block layer attaches the metadata buffer, generates the checksum
> (and reftag) for write I/O, and verifies it during read I/O.
> 
> Btrfs has its own data and metadata checksumming, which is currently
> disconnected from the above.
> It maintains a separate on-device 'checksum tree' for data checksums,
> while the block layer will also be checksumming each Btrfs I/O.
> 
> There is value in avoiding Copy-on-write (COW) checksum tree on
> a device that can anyway store checksums inline (as part of PI).
> This would eliminate extra checksum writes/reads, making I/O
> more CPU-efficient.
> Additionally, usable space would increase, and write
> amplification, both in Btrfs and eventually at the device level, would
> be reduced [*].
> 
> NVMe drives can also automatically insert and strip the PI/checksum
> and provide a per-I/O control knob (the PRACT bit) for this.
> Block layer currently makes no attempt to know/advertise this offload.
> 
> This patch series: (a) adds checksum offload awareness to the
> block layer (patch #1),
> (b) enables the NVMe driver to register and support the offload
> (patch #2), and
> (c) introduces an opt-in (datasum_offload mount option) in Btrfs to
> apply checksum offload for data (patch #3).

Hi Kanchan,

This is an interesting approach on offloading the checksum work. I've 
only had a quick glance over it with a birds eye view and one thing that 
I noticed is, the missing connection of error reporting between the layers.

For instance if we get a checksum error on btrfs we not only report in 
in dmesg, but also try to repair the affected sector if we do have a 
data profile with redundancy.

So while this patchset offloads the submission side work of the checksum 
tree to the PI code, I don't see the back-propagation of the errors into 
btrfs and the triggering of the repair code.

I get it's a RFC, but as it is now it essentially breaks functionality 
we rely on. Can you add this part as well so we can evaluate the 
patchset not only from the write but also from the read side.

Byte,
	Johannes
Keith Busch Jan. 29, 2025, 3:28 p.m. UTC | #2
On Wed, Jan 29, 2025 at 07:32:04PM +0530, Kanchan Joshi wrote:
> There is value in avoiding Copy-on-write (COW) checksum tree on
> a device that can anyway store checksums inline (as part of PI).
> This would eliminate extra checksum writes/reads, making I/O
> more CPU-efficient.

Another potential benefit: if the device does the checksum, then I think
btrfs could avoid the stable page writeback overhead and let the
contents be changable all the way until it goes out on the wire.

Though I feel the very specific device format constraints that can
support an offload like this are a unfortunate.
Christoph Hellwig Jan. 29, 2025, 3:35 p.m. UTC | #3
On Wed, Jan 29, 2025 at 07:32:04PM +0530, Kanchan Joshi wrote:
> End-to-end data protection (E2EDP)-capable drives require the transfer
> of integrity metadata (PI).
> This is currently handled by the block layer, without filesystem
> involvement/awareness.
> The block layer attaches the metadata buffer, generates the checksum
> (and reftag) for write I/O, and verifies it during read I/O.

That's not quite true.  The block layer automatically adds a PI
payload if not is added by the caller.  The caller can add it's own
PI payload, but currently no file system does this - only the block
device fops as of 6.13 and the nvme and scsi targets.  But file systems
can do that, and I have (hacky and outdated patches) wiring this up
in XFS.

Note that the "auto-PI" vs "caller-PI" isn't very cleanly split
currently, which causes some confusion.  I have a series almost
ready to clean that up a bit.

> There is value in avoiding Copy-on-write (COW) checksum tree on
> a device that can anyway store checksums inline (as part of PI).

Yes.

> This patch series: (a) adds checksum offload awareness to the
> block layer (patch #1),

I've skipped over the patches and don't understand what this offload
awareness concept does compared the file system simply attaching PI
metadata.

> (c) introduces an opt-in (datasum_offload mount option) in Btrfs to
> apply checksum offload for data (patch #3).

Not really important for an initial prototype, but incompatible on-disk
format changes like this need feature flags and not just a mount
option.
Christoph Hellwig Jan. 29, 2025, 3:40 p.m. UTC | #4
On Wed, Jan 29, 2025 at 08:28:24AM -0700, Keith Busch wrote:
> On Wed, Jan 29, 2025 at 07:32:04PM +0530, Kanchan Joshi wrote:
> > There is value in avoiding Copy-on-write (COW) checksum tree on
> > a device that can anyway store checksums inline (as part of PI).
> > This would eliminate extra checksum writes/reads, making I/O
> > more CPU-efficient.
> 
> Another potential benefit: if the device does the checksum, then I think
> btrfs could avoid the stable page writeback overhead and let the
> contents be changable all the way until it goes out on the wire.

If the device generates the checksum (aka DIF insert) that problem goes
away.  But we also lose integrity protection over the wire, which would
be unfortunate.

If you feed the checksum / guard tag from the kernel we still have the
same problem.  A while ago I did a prototype where we'd bubble up to the
fs that we had guard tag error vs just the non-specific "protection
error" and the file system would then retry after copying.  This was
pretty sketchy as the error handling blew up frequently and at least my
version would only work for synchronous I/O and not with aio / io_uring
due to the missing MM context.  But if someone has enough spare cycles
that could be something interesting to look into again.
Mark Harmstone Jan. 29, 2025, 3:55 p.m. UTC | #5
On 29/1/25 14:02, Kanchan Joshi wrote:
> > 
> TL;DR first: this makes Btrfs chuck its checksum tree and leverage NVMe
> SSD for data checksumming.
> 
> Now, the longer version for why/how.
> 
> End-to-end data protection (E2EDP)-capable drives require the transfer
> of integrity metadata (PI).
> This is currently handled by the block layer, without filesystem
> involvement/awareness.
> The block layer attaches the metadata buffer, generates the checksum
> (and reftag) for write I/O, and verifies it during read I/O.
> 
> Btrfs has its own data and metadata checksumming, which is currently
> disconnected from the above.
> It maintains a separate on-device 'checksum tree' for data checksums,
> while the block layer will also be checksumming each Btrfs I/O.
> 
> There is value in avoiding Copy-on-write (COW) checksum tree on
> a device that can anyway store checksums inline (as part of PI).
> This would eliminate extra checksum writes/reads, making I/O
> more CPU-efficient.
> Additionally, usable space would increase, and write
> amplification, both in Btrfs and eventually at the device level, would
> be reduced [*].
> 
> NVMe drives can also automatically insert and strip the PI/checksum
> and provide a per-I/O control knob (the PRACT bit) for this.
> Block layer currently makes no attempt to know/advertise this offload.
> 
> This patch series: (a) adds checksum offload awareness to the
> block layer (patch #1),
> (b) enables the NVMe driver to register and support the offload
> (patch #2), and
> (c) introduces an opt-in (datasum_offload mount option) in Btrfs to
> apply checksum offload for data (patch #3).
> 
> [*] Here are some perf/write-amplification numbers from randwrite test [1]
> on 3 configs (same device):
> Config 1: No meta format (4K) + Btrfs (base)
> Config 2: Meta format (4K + 8b) + Btrfs (base)
> Config 3: Meta format (4K + 8b) + Btrfs (datasum_offload)
> 
> In config 1 and 2, Btrfs will operate with a checksum tree.
> Only in config 2, block-layer will attach integrity buffer with each I/O and
> do checksum/reftag verification.
> Only in config 3, offload will take place and device will generate/verify
> the checksum.
> 
> AppW: writes issued by app, 120G (4 Jobs, each writing 30G)
> FsW: writes issued to device (from iostat)
> ExtraW: extra writes compared to AppW
> 
> Direct I/O
> ---------------------------------------------------------
> Config		IOPS(K)		FsW(G)		ExtraW(G)
> 1		144		186		66
> 2		141		181		61
> 3		172		129		9
> 
> Buffered I/O
> ---------------------------------------------------------
> Config		IOPS(K)		FsW(G)		ExtraW(G)
> 1		82		255		135
> 2		80		181		132
> 3		100		199		79
> 
> Write amplification is generally high (and that's understandable given
> B-trees) but not sure why buffered I/O shows that much.
> 
> [1] fio --name=btrfswrite --ioengine=io_uring --directory=/mnt --blocksize=4k --readwrite=randwrite --filesize=30G --numjobs=4 --iodepth=32 --randseed=0 --direct=1 -output=out --group_reporting
> 
> 
> Kanchan Joshi (3):
>    block: add integrity offload
>    nvme: support integrity offload
>    btrfs: add checksum offload
> 
>   block/bio-integrity.c     | 42 ++++++++++++++++++++++++++++++++++++++-
>   block/t10-pi.c            |  7 +++++++
>   drivers/nvme/host/core.c  | 24 ++++++++++++++++++++++
>   drivers/nvme/host/nvme.h  |  1 +
>   fs/btrfs/bio.c            | 12 +++++++++++
>   fs/btrfs/fs.h             |  1 +
>   fs/btrfs/super.c          |  9 +++++++++
>   include/linux/blk_types.h |  3 +++
>   include/linux/blkdev.h    |  7 +++++++
>   9 files changed, 105 insertions(+), 1 deletion(-)
> 

There's also checksumming done on the metadata trees, which could be 
avoided if we're trusting the block device to do it.

Maybe rather than putting this behind a new compat flag, add a new csum 
type of "none"? With the logic being that it also zeroes out the csum 
field in the B-tree headers.

Mark
Keith Busch Jan. 29, 2025, 6:03 p.m. UTC | #6
On Wed, Jan 29, 2025 at 04:40:25PM +0100, Christoph Hellwig wrote:
> > 
> > Another potential benefit: if the device does the checksum, then I think
> > btrfs could avoid the stable page writeback overhead and let the
> > contents be changable all the way until it goes out on the wire.
> 
> If the device generates the checksum (aka DIF insert) that problem goes
> away.  But we also lose integrity protection over the wire, which would
> be unfortunate.

If the "wire" is only PCIe, I don't see why it matters. What kind of
wire corruption gets undetected by the protocol's encoding and LCRC that
would get caught by the host's CRC payload?
Goffredo Baroncelli Jan. 29, 2025, 7:02 p.m. UTC | #7
On 29/01/2025 15.02, Kanchan Joshi wrote:
> TL;DR first: this makes Btrfs chuck its checksum tree and leverage NVMe
> SSD for data checksumming.
> 
> Now, the longer version for why/how.
> 
> End-to-end data protection (E2EDP)-capable drives require the transfer
> of integrity metadata (PI).
> This is currently handled by the block layer, without filesystem
> involvement/awareness.
> The block layer attaches the metadata buffer, generates the checksum
> (and reftag) for write I/O, and verifies it during read I/O.
> 
May be this is a stupid question, but if we can (will) avoid storing the checksum
in the FS, which is the advantage of having a COW filesystem ?

My understand is that a COW filesystem is needed mainly to synchronize
csum and data. Am I wrong ?

[...]

BR
Kanchan Joshi Jan. 30, 2025, 9:22 a.m. UTC | #8
On 1/29/2025 9:05 PM, Christoph Hellwig wrote:
>> This patch series: (a) adds checksum offload awareness to the
>> block layer (patch #1),
> I've skipped over the patches and don't understand what this offload
> awareness concept does compared the file system simply attaching PI
> metadata.

Difference is that FS does not have to attach any PI for offload.

Offload is about the Host doing as little as possible, and the closest 
we get there is by setting PRACT bit.

Attaching PI is not really needed, neither for FS nor for block-layer, 
for pure offload.
When device has "ms == pi_size" format, we only need to send I/O with 
PRACT set and device take care of attaching integrity buffer and 
checksum generation/verification.
This is abstracted as 'offload type 1' in this series.

For other format "ms > pi_size" also we set the PRACT but integrity 
buffer also needs to be passed. This is abstracted as 'offload type 2'.
Still offload as the checksum processing is done only by the device.

Block layer Auto-PI is a good place because all above details are common 
and remain abstracted, while filesystems only need to decide whether 
they want to send the flag (REQ_INTEGRITY_OFFLOAD) to use the facility.
Daniel Vacek Jan. 30, 2025, 9:33 a.m. UTC | #9
On Wed, 29 Jan 2025 at 20:04, Goffredo Baroncelli <kreijack@libero.it> wrote:
>
> On 29/01/2025 15.02, Kanchan Joshi wrote:
> > TL;DR first: this makes Btrfs chuck its checksum tree and leverage NVMe
> > SSD for data checksumming.
> >
> > Now, the longer version for why/how.
> >
> > End-to-end data protection (E2EDP)-capable drives require the transfer
> > of integrity metadata (PI).
> > This is currently handled by the block layer, without filesystem
> > involvement/awareness.
> > The block layer attaches the metadata buffer, generates the checksum
> > (and reftag) for write I/O, and verifies it during read I/O.
> >
> May be this is a stupid question, but if we can (will) avoid storing the checksum
> in the FS, which is the advantage of having a COW filesystem ?

I was wondering the same. My understanding is the checksums are there
primarily to protect against untrusted devices or data transfers over
the line. And now suddenly we're going to trust them? What's even the
point then?

Is there any other advantage of having these checksums that I may be missing?
Perhaps logic code bugs accidentally corrupting the data? Is the
stored payload even ever touched? That would not be wanted, right?
Or perhaps data mangled on the storage by an attacker?

> My understand is that a COW filesystem is needed mainly to synchronize
> csum and data. Am I wrong ?
>
> [...]
>
> BR
>
> --
> gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
>
Christoph Hellwig Jan. 30, 2025, 12:53 p.m. UTC | #10
On Thu, Jan 30, 2025 at 02:52:23PM +0530, Kanchan Joshi wrote:
> On 1/29/2025 9:05 PM, Christoph Hellwig wrote:
> >> This patch series: (a) adds checksum offload awareness to the
> >> block layer (patch #1),
> > I've skipped over the patches and don't understand what this offload
> > awareness concept does compared the file system simply attaching PI
> > metadata.
> 
> Difference is that FS does not have to attach any PI for offload.
> 
> Offload is about the Host doing as little as possible, and the closest 
> we get there is by setting PRACT bit.

But that doesn't actually work.  The file system needs to be able
to verify the checksum for failing over to other mirrors, repair,
etc.  Also if you trust the device to get things right you do not
need to use PI at all - SSDs or hard drives that support PI generally
use PI internally anyway, and PRACT just means you treat a format
with PI like one without.  In other words - no need for an offload
here, you might as well just trust the device if you're not doing
end to end protection.

> 
> Attaching PI is not really needed, neither for FS nor for block-layer, 
> for pure offload.
> When device has "ms == pi_size" format, we only need to send I/O with 
> PRACT set and device take care of attaching integrity buffer and 
> checksum generation/verification.
> This is abstracted as 'offload type 1' in this series.
> 
> For other format "ms > pi_size" also we set the PRACT but integrity 
> buffer also needs to be passed. This is abstracted as 'offload type 2'.
> Still offload as the checksum processing is done only by the device.
> 
> Block layer Auto-PI is a good place because all above details are common 
> and remain abstracted, while filesystems only need to decide whether 
> they want to send the flag (REQ_INTEGRITY_OFFLOAD) to use the facility.
---end quoted text---
Christoph Hellwig Jan. 30, 2025, 12:54 p.m. UTC | #11
On Wed, Jan 29, 2025 at 11:03:36AM -0700, Keith Busch wrote:
> > away.  But we also lose integrity protection over the wire, which would
> > be unfortunate.
> 
> If the "wire" is only PCIe, I don't see why it matters. What kind of
> wire corruption gets undetected by the protocol's encoding and LCRC that
> would get caught by the host's CRC payload?

The "wire" could be anything.  And includes a little more than than
than the wire, like the entire host side driver stack and the device
data path between the phy and wherever in the stack the PI insert/strip
accelerator sits.
Martin K. Petersen Jan. 30, 2025, 8:21 p.m. UTC | #12
Hi Kanchan!

> There is value in avoiding Copy-on-write (COW) checksum tree on a
> device that can anyway store checksums inline (as part of PI). This
> would eliminate extra checksum writes/reads, making I/O more
> CPU-efficient. Additionally, usable space would increase, and write
> amplification, both in Btrfs and eventually at the device level, would
> be reduced [*].

I have a couple of observations.

First of all, there is no inherent benefit to PI if it is generated at
the same time as the ECC. The ECC is usually far superior when it comes
to protecting data at rest. And you'll still get an error if uncorrected
corruption is detected. So BLK_INTEGRITY_OFFLOAD_NO_BUF does not offer
any benefits in my book.

The motivation for T10 PI is that it is generated in close temporal
proximity to the data. I.e. ideally the PI protecting the data is
calculated as soon as the data has been created in memory. And then the
I/O will eventually be queued, submitted, traverse the kernel, through
the storage fabric, and out to the end device. The PI and data have
traveled along different paths (potentially, more on that later) to get
there. The device will calculate the ECC and then perform a validation
of the PI wrt. to the data buffer. And if those two line up, we know the
ECC is also good. At that point we have confirmed that the data to be
stored matches the data that was used as input when the PI was generated
N seconds ago in host memory. And therefore we can write.

I.e. the goal of PI is protect against problems that happen between data
creation time and the data being persisted to media. Once the ECC has
been calculated, PI essentially stops being interesting.

The second point I would like to make is that the separation between PI
and data that we introduced with DIX, and which NVMe subsequently
adopted, was a feature. It was not just to avoid the inconvenience of
having to deal with buffers that were multiples of 520 bytes in host
memory. The separation between the data and its associated protection
information had proven critical for data protection in many common
corruption scenarios. Inline protection had been tried and had failed to
catch many of the scenarios we had come across in the field.

At the time T10 PI was designed spinning rust was the only game in town.
And nobody was willing to take the performance hit of having to seek
twice per I/O to store PI separately from the data. And while schemes
involving sending all the PI ahead of the data were entertained, they
never came to fruition. Storing 512+8 in the same sector was a necessity
in the context of SCSI drives, not a desired behavior. Addressing that
in DIX was key.

So to me, it's a highly desirable feature that btrfs stores its
checksums elsewhere on media. But that's obviously a trade-off a user
can make. In some cases media WAR may be more important than extending
the protection envelope for the data, that's OK. I would suggest you
look at using CRC32C given the intended 4KB block use case, though,
because the 16-bit CRC isn't fantastic for large blocks.