Message ID | 20250129140207.22718-1-joshi.k@samsung.com (mailing list archive) |
---|---|
Headers | show |
Series | Btrfs checksum offload | expand |
On 29.01.25 15:13, Kanchan Joshi wrote: > TL;DR first: this makes Btrfs chuck its checksum tree and leverage NVMe > SSD for data checksumming. > > Now, the longer version for why/how. > > End-to-end data protection (E2EDP)-capable drives require the transfer > of integrity metadata (PI). > This is currently handled by the block layer, without filesystem > involvement/awareness. > The block layer attaches the metadata buffer, generates the checksum > (and reftag) for write I/O, and verifies it during read I/O. > > Btrfs has its own data and metadata checksumming, which is currently > disconnected from the above. > It maintains a separate on-device 'checksum tree' for data checksums, > while the block layer will also be checksumming each Btrfs I/O. > > There is value in avoiding Copy-on-write (COW) checksum tree on > a device that can anyway store checksums inline (as part of PI). > This would eliminate extra checksum writes/reads, making I/O > more CPU-efficient. > Additionally, usable space would increase, and write > amplification, both in Btrfs and eventually at the device level, would > be reduced [*]. > > NVMe drives can also automatically insert and strip the PI/checksum > and provide a per-I/O control knob (the PRACT bit) for this. > Block layer currently makes no attempt to know/advertise this offload. > > This patch series: (a) adds checksum offload awareness to the > block layer (patch #1), > (b) enables the NVMe driver to register and support the offload > (patch #2), and > (c) introduces an opt-in (datasum_offload mount option) in Btrfs to > apply checksum offload for data (patch #3). Hi Kanchan, This is an interesting approach on offloading the checksum work. I've only had a quick glance over it with a birds eye view and one thing that I noticed is, the missing connection of error reporting between the layers. For instance if we get a checksum error on btrfs we not only report in in dmesg, but also try to repair the affected sector if we do have a data profile with redundancy. So while this patchset offloads the submission side work of the checksum tree to the PI code, I don't see the back-propagation of the errors into btrfs and the triggering of the repair code. I get it's a RFC, but as it is now it essentially breaks functionality we rely on. Can you add this part as well so we can evaluate the patchset not only from the write but also from the read side. Byte, Johannes
On Wed, Jan 29, 2025 at 07:32:04PM +0530, Kanchan Joshi wrote: > There is value in avoiding Copy-on-write (COW) checksum tree on > a device that can anyway store checksums inline (as part of PI). > This would eliminate extra checksum writes/reads, making I/O > more CPU-efficient. Another potential benefit: if the device does the checksum, then I think btrfs could avoid the stable page writeback overhead and let the contents be changable all the way until it goes out on the wire. Though I feel the very specific device format constraints that can support an offload like this are a unfortunate.
On Wed, Jan 29, 2025 at 07:32:04PM +0530, Kanchan Joshi wrote: > End-to-end data protection (E2EDP)-capable drives require the transfer > of integrity metadata (PI). > This is currently handled by the block layer, without filesystem > involvement/awareness. > The block layer attaches the metadata buffer, generates the checksum > (and reftag) for write I/O, and verifies it during read I/O. That's not quite true. The block layer automatically adds a PI payload if not is added by the caller. The caller can add it's own PI payload, but currently no file system does this - only the block device fops as of 6.13 and the nvme and scsi targets. But file systems can do that, and I have (hacky and outdated patches) wiring this up in XFS. Note that the "auto-PI" vs "caller-PI" isn't very cleanly split currently, which causes some confusion. I have a series almost ready to clean that up a bit. > There is value in avoiding Copy-on-write (COW) checksum tree on > a device that can anyway store checksums inline (as part of PI). Yes. > This patch series: (a) adds checksum offload awareness to the > block layer (patch #1), I've skipped over the patches and don't understand what this offload awareness concept does compared the file system simply attaching PI metadata. > (c) introduces an opt-in (datasum_offload mount option) in Btrfs to > apply checksum offload for data (patch #3). Not really important for an initial prototype, but incompatible on-disk format changes like this need feature flags and not just a mount option.
On Wed, Jan 29, 2025 at 08:28:24AM -0700, Keith Busch wrote: > On Wed, Jan 29, 2025 at 07:32:04PM +0530, Kanchan Joshi wrote: > > There is value in avoiding Copy-on-write (COW) checksum tree on > > a device that can anyway store checksums inline (as part of PI). > > This would eliminate extra checksum writes/reads, making I/O > > more CPU-efficient. > > Another potential benefit: if the device does the checksum, then I think > btrfs could avoid the stable page writeback overhead and let the > contents be changable all the way until it goes out on the wire. If the device generates the checksum (aka DIF insert) that problem goes away. But we also lose integrity protection over the wire, which would be unfortunate. If you feed the checksum / guard tag from the kernel we still have the same problem. A while ago I did a prototype where we'd bubble up to the fs that we had guard tag error vs just the non-specific "protection error" and the file system would then retry after copying. This was pretty sketchy as the error handling blew up frequently and at least my version would only work for synchronous I/O and not with aio / io_uring due to the missing MM context. But if someone has enough spare cycles that could be something interesting to look into again.
On 29/1/25 14:02, Kanchan Joshi wrote: > > > TL;DR first: this makes Btrfs chuck its checksum tree and leverage NVMe > SSD for data checksumming. > > Now, the longer version for why/how. > > End-to-end data protection (E2EDP)-capable drives require the transfer > of integrity metadata (PI). > This is currently handled by the block layer, without filesystem > involvement/awareness. > The block layer attaches the metadata buffer, generates the checksum > (and reftag) for write I/O, and verifies it during read I/O. > > Btrfs has its own data and metadata checksumming, which is currently > disconnected from the above. > It maintains a separate on-device 'checksum tree' for data checksums, > while the block layer will also be checksumming each Btrfs I/O. > > There is value in avoiding Copy-on-write (COW) checksum tree on > a device that can anyway store checksums inline (as part of PI). > This would eliminate extra checksum writes/reads, making I/O > more CPU-efficient. > Additionally, usable space would increase, and write > amplification, both in Btrfs and eventually at the device level, would > be reduced [*]. > > NVMe drives can also automatically insert and strip the PI/checksum > and provide a per-I/O control knob (the PRACT bit) for this. > Block layer currently makes no attempt to know/advertise this offload. > > This patch series: (a) adds checksum offload awareness to the > block layer (patch #1), > (b) enables the NVMe driver to register and support the offload > (patch #2), and > (c) introduces an opt-in (datasum_offload mount option) in Btrfs to > apply checksum offload for data (patch #3). > > [*] Here are some perf/write-amplification numbers from randwrite test [1] > on 3 configs (same device): > Config 1: No meta format (4K) + Btrfs (base) > Config 2: Meta format (4K + 8b) + Btrfs (base) > Config 3: Meta format (4K + 8b) + Btrfs (datasum_offload) > > In config 1 and 2, Btrfs will operate with a checksum tree. > Only in config 2, block-layer will attach integrity buffer with each I/O and > do checksum/reftag verification. > Only in config 3, offload will take place and device will generate/verify > the checksum. > > AppW: writes issued by app, 120G (4 Jobs, each writing 30G) > FsW: writes issued to device (from iostat) > ExtraW: extra writes compared to AppW > > Direct I/O > --------------------------------------------------------- > Config IOPS(K) FsW(G) ExtraW(G) > 1 144 186 66 > 2 141 181 61 > 3 172 129 9 > > Buffered I/O > --------------------------------------------------------- > Config IOPS(K) FsW(G) ExtraW(G) > 1 82 255 135 > 2 80 181 132 > 3 100 199 79 > > Write amplification is generally high (and that's understandable given > B-trees) but not sure why buffered I/O shows that much. > > [1] fio --name=btrfswrite --ioengine=io_uring --directory=/mnt --blocksize=4k --readwrite=randwrite --filesize=30G --numjobs=4 --iodepth=32 --randseed=0 --direct=1 -output=out --group_reporting > > > Kanchan Joshi (3): > block: add integrity offload > nvme: support integrity offload > btrfs: add checksum offload > > block/bio-integrity.c | 42 ++++++++++++++++++++++++++++++++++++++- > block/t10-pi.c | 7 +++++++ > drivers/nvme/host/core.c | 24 ++++++++++++++++++++++ > drivers/nvme/host/nvme.h | 1 + > fs/btrfs/bio.c | 12 +++++++++++ > fs/btrfs/fs.h | 1 + > fs/btrfs/super.c | 9 +++++++++ > include/linux/blk_types.h | 3 +++ > include/linux/blkdev.h | 7 +++++++ > 9 files changed, 105 insertions(+), 1 deletion(-) > There's also checksumming done on the metadata trees, which could be avoided if we're trusting the block device to do it. Maybe rather than putting this behind a new compat flag, add a new csum type of "none"? With the logic being that it also zeroes out the csum field in the B-tree headers. Mark
On Wed, Jan 29, 2025 at 04:40:25PM +0100, Christoph Hellwig wrote: > > > > Another potential benefit: if the device does the checksum, then I think > > btrfs could avoid the stable page writeback overhead and let the > > contents be changable all the way until it goes out on the wire. > > If the device generates the checksum (aka DIF insert) that problem goes > away. But we also lose integrity protection over the wire, which would > be unfortunate. If the "wire" is only PCIe, I don't see why it matters. What kind of wire corruption gets undetected by the protocol's encoding and LCRC that would get caught by the host's CRC payload?
On 29/01/2025 15.02, Kanchan Joshi wrote: > TL;DR first: this makes Btrfs chuck its checksum tree and leverage NVMe > SSD for data checksumming. > > Now, the longer version for why/how. > > End-to-end data protection (E2EDP)-capable drives require the transfer > of integrity metadata (PI). > This is currently handled by the block layer, without filesystem > involvement/awareness. > The block layer attaches the metadata buffer, generates the checksum > (and reftag) for write I/O, and verifies it during read I/O. > May be this is a stupid question, but if we can (will) avoid storing the checksum in the FS, which is the advantage of having a COW filesystem ? My understand is that a COW filesystem is needed mainly to synchronize csum and data. Am I wrong ? [...] BR
On 1/29/2025 9:05 PM, Christoph Hellwig wrote: >> This patch series: (a) adds checksum offload awareness to the >> block layer (patch #1), > I've skipped over the patches and don't understand what this offload > awareness concept does compared the file system simply attaching PI > metadata. Difference is that FS does not have to attach any PI for offload. Offload is about the Host doing as little as possible, and the closest we get there is by setting PRACT bit. Attaching PI is not really needed, neither for FS nor for block-layer, for pure offload. When device has "ms == pi_size" format, we only need to send I/O with PRACT set and device take care of attaching integrity buffer and checksum generation/verification. This is abstracted as 'offload type 1' in this series. For other format "ms > pi_size" also we set the PRACT but integrity buffer also needs to be passed. This is abstracted as 'offload type 2'. Still offload as the checksum processing is done only by the device. Block layer Auto-PI is a good place because all above details are common and remain abstracted, while filesystems only need to decide whether they want to send the flag (REQ_INTEGRITY_OFFLOAD) to use the facility.
On Wed, 29 Jan 2025 at 20:04, Goffredo Baroncelli <kreijack@libero.it> wrote: > > On 29/01/2025 15.02, Kanchan Joshi wrote: > > TL;DR first: this makes Btrfs chuck its checksum tree and leverage NVMe > > SSD for data checksumming. > > > > Now, the longer version for why/how. > > > > End-to-end data protection (E2EDP)-capable drives require the transfer > > of integrity metadata (PI). > > This is currently handled by the block layer, without filesystem > > involvement/awareness. > > The block layer attaches the metadata buffer, generates the checksum > > (and reftag) for write I/O, and verifies it during read I/O. > > > May be this is a stupid question, but if we can (will) avoid storing the checksum > in the FS, which is the advantage of having a COW filesystem ? I was wondering the same. My understanding is the checksums are there primarily to protect against untrusted devices or data transfers over the line. And now suddenly we're going to trust them? What's even the point then? Is there any other advantage of having these checksums that I may be missing? Perhaps logic code bugs accidentally corrupting the data? Is the stored payload even ever touched? That would not be wanted, right? Or perhaps data mangled on the storage by an attacker? > My understand is that a COW filesystem is needed mainly to synchronize > csum and data. Am I wrong ? > > [...] > > BR > > -- > gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it> > Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 >
On Thu, Jan 30, 2025 at 02:52:23PM +0530, Kanchan Joshi wrote: > On 1/29/2025 9:05 PM, Christoph Hellwig wrote: > >> This patch series: (a) adds checksum offload awareness to the > >> block layer (patch #1), > > I've skipped over the patches and don't understand what this offload > > awareness concept does compared the file system simply attaching PI > > metadata. > > Difference is that FS does not have to attach any PI for offload. > > Offload is about the Host doing as little as possible, and the closest > we get there is by setting PRACT bit. But that doesn't actually work. The file system needs to be able to verify the checksum for failing over to other mirrors, repair, etc. Also if you trust the device to get things right you do not need to use PI at all - SSDs or hard drives that support PI generally use PI internally anyway, and PRACT just means you treat a format with PI like one without. In other words - no need for an offload here, you might as well just trust the device if you're not doing end to end protection. > > Attaching PI is not really needed, neither for FS nor for block-layer, > for pure offload. > When device has "ms == pi_size" format, we only need to send I/O with > PRACT set and device take care of attaching integrity buffer and > checksum generation/verification. > This is abstracted as 'offload type 1' in this series. > > For other format "ms > pi_size" also we set the PRACT but integrity > buffer also needs to be passed. This is abstracted as 'offload type 2'. > Still offload as the checksum processing is done only by the device. > > Block layer Auto-PI is a good place because all above details are common > and remain abstracted, while filesystems only need to decide whether > they want to send the flag (REQ_INTEGRITY_OFFLOAD) to use the facility. ---end quoted text---
On Wed, Jan 29, 2025 at 11:03:36AM -0700, Keith Busch wrote: > > away. But we also lose integrity protection over the wire, which would > > be unfortunate. > > If the "wire" is only PCIe, I don't see why it matters. What kind of > wire corruption gets undetected by the protocol's encoding and LCRC that > would get caught by the host's CRC payload? The "wire" could be anything. And includes a little more than than than the wire, like the entire host side driver stack and the device data path between the phy and wherever in the stack the PI insert/strip accelerator sits.
Hi Kanchan! > There is value in avoiding Copy-on-write (COW) checksum tree on a > device that can anyway store checksums inline (as part of PI). This > would eliminate extra checksum writes/reads, making I/O more > CPU-efficient. Additionally, usable space would increase, and write > amplification, both in Btrfs and eventually at the device level, would > be reduced [*]. I have a couple of observations. First of all, there is no inherent benefit to PI if it is generated at the same time as the ECC. The ECC is usually far superior when it comes to protecting data at rest. And you'll still get an error if uncorrected corruption is detected. So BLK_INTEGRITY_OFFLOAD_NO_BUF does not offer any benefits in my book. The motivation for T10 PI is that it is generated in close temporal proximity to the data. I.e. ideally the PI protecting the data is calculated as soon as the data has been created in memory. And then the I/O will eventually be queued, submitted, traverse the kernel, through the storage fabric, and out to the end device. The PI and data have traveled along different paths (potentially, more on that later) to get there. The device will calculate the ECC and then perform a validation of the PI wrt. to the data buffer. And if those two line up, we know the ECC is also good. At that point we have confirmed that the data to be stored matches the data that was used as input when the PI was generated N seconds ago in host memory. And therefore we can write. I.e. the goal of PI is protect against problems that happen between data creation time and the data being persisted to media. Once the ECC has been calculated, PI essentially stops being interesting. The second point I would like to make is that the separation between PI and data that we introduced with DIX, and which NVMe subsequently adopted, was a feature. It was not just to avoid the inconvenience of having to deal with buffers that were multiples of 520 bytes in host memory. The separation between the data and its associated protection information had proven critical for data protection in many common corruption scenarios. Inline protection had been tried and had failed to catch many of the scenarios we had come across in the field. At the time T10 PI was designed spinning rust was the only game in town. And nobody was willing to take the performance hit of having to seek twice per I/O to store PI separately from the data. And while schemes involving sending all the PI ahead of the data were entertained, they never came to fruition. Storing 512+8 in the same sector was a necessity in the context of SCSI drives, not a desired behavior. Addressing that in DIX was key. So to me, it's a highly desirable feature that btrfs stores its checksums elsewhere on media. But that's obviously a trade-off a user can make. In some cases media WAR may be more important than extending the protection envelope for the data, that's OK. I would suggest you look at using CRC32C given the intended 4KB block use case, though, because the 16-bit CRC isn't fantastic for large blocks.