Message ID | 8147ae0a45b9851eacad4e8f5a71b7997c23bdd0.1728071257.git.rgoldwyn@suse.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | btrfs reads through iomap | expand |
On Fri, Oct 04, 2024 at 04:04:33PM -0400, Goldwyn Rodrigues wrote: > Introduce read_inline() function hook for reading inline extents. This > is performed for filesystems such as btrfs which may compress the data > in the inline extents. This feels like an attempt to work around "iomap doesn't support compressed extents" by keeping the decompression in the filesystem, instead of extending iomap to support compressed extents itself. I'd certainly prefer iomap to support compressed extents, but maybe I'm in a minority here.
On Sat, Oct 05, 2024 at 03:30:53AM +0100, Matthew Wilcox wrote: > On Fri, Oct 04, 2024 at 04:04:33PM -0400, Goldwyn Rodrigues wrote: > > Introduce read_inline() function hook for reading inline extents. This > > is performed for filesystems such as btrfs which may compress the data > > in the inline extents. Why don't you set iomap->inline_data to the uncompressed buffer, let readahead copy it to the pagecache, and free it in ->iomap_end? > This feels like an attempt to work around "iomap doesn't support > compressed extents" by keeping the decompression in the filesystem, > instead of extending iomap to support compressed extents itself. > I'd certainly prefer iomap to support compressed extents, but maybe I'm > in a minority here. I'm not an expert on fs compression, but I get the impression that most filesystems handle reads by allocating some folios, reading off the disk into those folios, and decompressing into the pagecache folios. It might be kind of amusing to try to hoist that into the vfs/iomap at some point, but I think the next problem you'd run into is that fscrypt has similar requirements, since it's also a data transformation step. fsverity I think is less complicated because it only needs to read the pagecache contents at the very end to check it against the merkle tree. That, I think, is why this btrfs iomap port want to override submit_bio, right? So that it can allocate a separate set of folios, create a second bio around that, submit the second bio, decode what it read off the disk into the folios of the first bio, then "complete" the first bio so that iomap only has to update the pagecache state and doesn't have to know about the encoding magic? And then, having established that beachhead, porting btrfs fscrypt is a simple matter of adding more transformation steps to the ioend processing of the second bio (aka the one that actually calls the disk), right? And I think all that processing stuff is more or less already in place for the existing read path, so it should be trivial (ha!) to call it in an iomap context instead of straight from btrfs. iomap_folio_state notwithstanding, of course. Hmm. I'll have to give some thought to what would the ideal iomap data transformation pipeline look like? Though I already have a sneaking suspicion that will morph into "If I wanted to add {crypt,verity,compression} to xfs how would I do that?" -> "How would I design a pipeine to handle all three to avoid bouncing pages between workqueue threads like ext4 does?" -> "Oh gosh now I have a totally different design than any of the existing implementations." -> "Well, crumbs. :(" I'll start that by asking: Hey btrfs developers, what do you like and hate about the current way that btrfs handles fscrypt, compression, and fsverity? Assuming that you can set all three on a file, right? --D
On 10:47 07/10, Darrick J. Wong wrote: > On Sat, Oct 05, 2024 at 03:30:53AM +0100, Matthew Wilcox wrote: > > On Fri, Oct 04, 2024 at 04:04:33PM -0400, Goldwyn Rodrigues wrote: > > > Introduce read_inline() function hook for reading inline extents. This > > > is performed for filesystems such as btrfs which may compress the data > > > in the inline extents. > > Why don't you set iomap->inline_data to the uncompressed buffer, let > readahead copy it to the pagecache, and free it in ->iomap_end? This will increase the number of copies. BTRFS uncompresses directly into pagecache. Yes, this is an option but at the cost of efficiency. > > > This feels like an attempt to work around "iomap doesn't support > > compressed extents" by keeping the decompression in the filesystem, > > instead of extending iomap to support compressed extents itself. > > I'd certainly prefer iomap to support compressed extents, but maybe I'm > > in a minority here. > > I'm not an expert on fs compression, but I get the impression that most > filesystems handle reads by allocating some folios, reading off the disk > into those folios, and decompressing into the pagecache folios. It > might be kind of amusing to try to hoist that into the vfs/iomap at some > point, but I think the next problem you'd run into is that fscrypt has > similar requirements, since it's also a data transformation step. > fsverity I think is less complicated because it only needs to read the > pagecache contents at the very end to check it against the merkle tree. > > That, I think, is why this btrfs iomap port want to override submit_bio, > right? So that it can allocate a separate set of folios, create a > second bio around that, submit the second bio, decode what it read off > the disk into the folios of the first bio, then "complete" the first bio > so that iomap only has to update the pagecache state and doesn't have to > know about the encoding magic? Yes, but that is not the only reason. BTRFS also calculates and checks block checksums for data read during bio completions. > > And then, having established that beachhead, porting btrfs fscrypt is > a simple matter of adding more transformation steps to the ioend > processing of the second bio (aka the one that actually calls the disk), > right? And I think all that processing stuff is more or less already in > place for the existing read path, so it should be trivial (ha!) to > call it in an iomap context instead of straight from btrfs. > iomap_folio_state notwithstanding, of course. > > Hmm. I'll have to give some thought to what would the ideal iomap data > transformation pipeline look like? The order of transformation would make all the difference, and I am not sure if everyone involved can come to a conclusion that all transformations should be done in a particular decided order. FWIW, checksums are performed on what is read/written on disk. So for writes, compression happens before checksums. > > Though I already have a sneaking suspicion that will morph into "If I > wanted to add {crypt,verity,compression} to xfs how would I do that?" -> > "How would I design a pipeine to handle all three to avoid bouncing > pages between workqueue threads like ext4 does?" -> "Oh gosh now I have > a totally different design than any of the existing implementations." -> > "Well, crumbs. :(" > > I'll start that by asking: Hey btrfs developers, what do you like and > hate about the current way that btrfs handles fscrypt, compression, and > fsverity? Assuming that you can set all three on a file, right?
On Thu, Oct 10, 2024 at 02:10:25PM -0400, Goldwyn Rodrigues wrote: > On 10:47 07/10, Darrick J. Wong wrote: > > On Sat, Oct 05, 2024 at 03:30:53AM +0100, Matthew Wilcox wrote: > > > On Fri, Oct 04, 2024 at 04:04:33PM -0400, Goldwyn Rodrigues wrote: > > > > Introduce read_inline() function hook for reading inline extents. This > > > > is performed for filesystems such as btrfs which may compress the data > > > > in the inline extents. > > > > Why don't you set iomap->inline_data to the uncompressed buffer, let > > readahead copy it to the pagecache, and free it in ->iomap_end? > > This will increase the number of copies. BTRFS uncompresses directly > into pagecache. Yes, this is an option but at the cost of efficiency. Is that such a problem? The process of decompression means the data is already hot in cache, so there isn't a huge extra penalty for moving this inline data a second time... > > > This feels like an attempt to work around "iomap doesn't support > > > compressed extents" by keeping the decompression in the filesystem, > > > instead of extending iomap to support compressed extents itself. > > > I'd certainly prefer iomap to support compressed extents, but maybe I'm > > > in a minority here. > > > > I'm not an expert on fs compression, but I get the impression that most > > filesystems handle reads by allocating some folios, reading off the disk > > into those folios, and decompressing into the pagecache folios. It > > might be kind of amusing to try to hoist that into the vfs/iomap at some > > point, but I think the next problem you'd run into is that fscrypt has > > similar requirements, since it's also a data transformation step. > > fsverity I think is less complicated because it only needs to read the > > pagecache contents at the very end to check it against the merkle tree. > > > > That, I think, is why this btrfs iomap port want to override submit_bio, > > right? So that it can allocate a separate set of folios, create a > > second bio around that, submit the second bio, decode what it read off > > the disk into the folios of the first bio, then "complete" the first bio > > so that iomap only has to update the pagecache state and doesn't have to > > know about the encoding magic? > > Yes, but that is not the only reason. BTRFS also calculates and checks > block checksums for data read during bio completions. This is no different to doing fsverity checks at read BIO io completion. We should be using the same mechanism in iomap for fsverity IO completion verification as filesystems do for data checksum verification because, conceptually speaking, they are the same operation. > > And then, having established that beachhead, porting btrfs fscrypt is > > a simple matter of adding more transformation steps to the ioend > > processing of the second bio (aka the one that actually calls the disk), > > right? And I think all that processing stuff is more or less already in > > place for the existing read path, so it should be trivial (ha!) to > > call it in an iomap context instead of straight from btrfs. > > iomap_folio_state notwithstanding, of course. > > > > Hmm. I'll have to give some thought to what would the ideal iomap data > > transformation pipeline look like? > > The order of transformation would make all the difference, and I am not > sure if everyone involved can come to a conclusion that all > transformations should be done in a particular decided order. I think there is only one viable order of data transformations to/from disk. That's because.... > FWIW, checksums are performed on what is read/written on disk. So > for writes, compression happens before checksums. .... there is specific ordering needed. For writes, the ordering is: 1. pre-write data compression - requires data copy 2. pre-write data encryption - requires data copy 3. pre-write data checksums - data read only 4. write the data 5. post-write metadata updates We cannot usefully perform compression after encryption - random data doesn't compress - and the checksum must match what is written to disk, so it has to come after all other transformations have been done. For reads, the order is: 1. read the data 2. verify the data checksum 3. decrypt the data - requires data copy 4. decompress the data - requires data copy 5. place the plain text data in the page cache To do 1) we need memory buffers allocated, 2) runs directly out of them. If no other transformations are required, then we can read the data directly into the folios in the page cache. However, if we have to decrypt and/or decompress the data, then we need multiple sets of bounce buffers - one of the data being read and one of the decrypted data. The data can be decompressed directly into the page cache. If there is no compression, the decryption should be done directly into the page cache. If there is nor decryption or compression, then the read IO should be done directly into the page cache and the checksum verification done by reading out of the page cache. IOWs, each step of the data read path needs to have "buffer" that is a set of folios that may or may not be the final page cache location. The other consideration here is we may want to be able to cache these data transformation layers when we are doing rapid RMW operations. e.g. on encrypted data, a RMW will need to allocate two sets of bounce buffers - one for the read, another for the write. If the encrypted data is also hosted in the page cache (at some high offset beyond EOF) then we don't need to allocate the bounce buffer during write.... > > Though I already have a sneaking suspicion that will morph into "If I > > wanted to add {crypt,verity,compression} to xfs how would I do that?" -> > > "How would I design a pipeine to handle all three to avoid bouncing > > pages between workqueue threads like ext4 does?" -> "Oh gosh now I have > > a totally different design than any of the existing implementations." -> > > "Well, crumbs. :(" I've actually been thinking about how to do data CRCs, encryption and compression with XFS through iomap. I've even started writing a design doc, based on feedback from the first fsverity implementation attempt. Andrey and I are essentially working towards a "direct mapped remote xattr" model for fsverity merkle tree data. fsverity reads and writes are marshalled through the page cache at a filesystem determined offset beyond EOF (same model as ext4, et al), but the data is mapped into remote xattr blocks. This requires fixing the mistake of putting metadata headers into xattr remote blocks by moving the CRC to the remote xattr name structure such that remote xattrs only contain data again. The read/write ops that are passed to iomap for such IO are fsverity implementation specific that understand that the high offset is to be mapped directly to a filesystem block sized remote xattr extent and the IO is done directly on that block. The remote xattr block CRC can be retreived when it is mapped and validated on read IO completion by the iomap code before passing the data to fsverity. The reason for doing using xattrs in this way is that it provides a generic mechanism for storing multiple sets of optional data related and/or size-transformed data within a file and within the single page caceh address space the inode provides. For example, it allows XFS to implement native data checksums. For this, we need to store 2 x 32bit CRCs per filesystem block. i.e. we need the old CRC and the new CRC so we can write/log the CRC update before we write the data, and if we crash before the data hits the disk we still know what the original CRC was and so can validate that the filesystem block contains entirely either the old or the new data when it is next read. i.e. one of the two CRC values should always be valid. By using direct mapping into the page cache for CRCs, we don't need to continually refetch the checksum data. It gets read in once, and a large read will continue to pull CRC data from the cached folio as we walk through the data we read from disk. We can fetch the CRC data concurrently with the data itself, and IO completion processing doesn't signalled until both have been brought into cache. For writes, we can calculate the new CRCs sequentially and flush the cached CRC page but to the xattr when we reach the end of it. The data write bio that was built as we CRC'd the data can then be flushed once the xattr data has reached stable storage. There's a bit more latency to writeback operations here, but nothing is ever free... Compression is where using xattrs gets interesting - the xattrs can have a fixed "offset" they blong to, but can store variable sized data records for that offset. If we say we have a 64kB compression block size, we can store the compressed data for a 64k block entirely in a remote xattr even if compression fails (i.e. we can store the raw data, not the expanded "compressed" data). The remote xattr can store any amount of smaller data, and we map the compressed data directly into the page cache at a high offset. Then decompression can run on the high offset pages with the destination being some other page cache offset.... On the write side, compression can be done directly into the high offset page cache range for that 64kb offset range, then we can map that to a remote xattr block and write the xattr. The xattr naturally handles variable size blocks. If we overwrite the compressed data (e.g. a 4kB RMW cycle in a 64kB block), then we essentially overwrite the compressed data in the page cache at writeback, extending the folio set for that record if it grows. Then on flush of the compressed record, we atomically replace the remote xattr we already have so we are guaranteed that we'll see either the old or the new compressed data at that point in time. The new remote xattr block is CRC'd by the xattr code, so if we are using compression-only, we don't actually need separate on-disk data CRC xattrs... Encryption/decryption doesn't require alternate data storage, so that just requires reading the encrypted data into a high page cache offset. However, if we are compressing then encrypting, then after the encryption step we'd need to store it via the compression method, not the uncompressed method. Hence there are some implementation details needed to be worked through here. ----- Overall, I'm looking an iomap mechanism that stages every part of the transformation stack in the page cache itself. We have address space to burn because we don't support file sizes larger than 8EiB. That leaves a full 8EiB of address space available for hosting non-user-visible ephemeral IO path data states. How to break up that address space for different data transformation uses is an open question. Files with transformed data would be limited to the size of the address space reserved by iomap for transformed data, so we want it to be large. XFS can't use more than 2^32 -extents- for xattrs on an inode at this point in time. That means >4 billion alternate data records can be stored, not that 256TB is the maximum offset that can be supported. Hence I'm tending towards at least 1PB per alternate address range, which would limit encrypted and/or compressed files to this size. That leaves over 8000 alternate address ranges iomap can support, but I suspect that we'll want fewer, larger ranges in preference to more, smaller ranges.... > > I'll start that by asking: Hey btrfs developers, what do you like and > > hate about the current way that btrfs handles fscrypt, compression, and > > fsverity? Assuming that you can set all three on a file, right? I'm definitely assuming that all 3 can be set at once, as well as low level filesystem data CRCs underneath all the layered "VFS" stuff. However, what layers we actually need at any point in time can be malleable. For XFS, compression would naturally be CRC'd by the xattr storage so it wouldn't need that layer at all. If fsverity is enabled and fs admin tooling understands fsverity, then we probably don't need fs checksums for fsverity files. so there definitely needs to be some flexibility in the stack to allow filesystems to elide transformation layers that are redundant... -Dave.
Hi Dave, On 2024/10/11 08:43, Dave Chinner wrote: > On Thu, Oct 10, 2024 at 02:10:25PM -0400, Goldwyn Rodrigues wrote: ... > > .... there is specific ordering needed. > > For writes, the ordering is: > > 1. pre-write data compression - requires data copy > 2. pre-write data encryption - requires data copy > 3. pre-write data checksums - data read only > 4. write the data > 5. post-write metadata updates > > We cannot usefully perform compression after encryption - > random data doesn't compress - and the checksum must match what is > written to disk, so it has to come after all other transformations > have been done. > > For reads, the order is: > > 1. read the data > 2. verify the data checksum > 3. decrypt the data - requires data copy > 4. decompress the data - requires data copy > 5. place the plain text data in the page cache Just random stuffs for for reference, currently fsverity makes markle tree for the plain text, but from the on-disk data security/authentication and integrity perspective, I guess the order you mentioned sounds more saner to me, or: 1. read the data 2. pre-verify the encoded data checksum (optional, dm-verity likewise way) 3. decrypt the data - requires data copy 4. decompress the data - requires data copy 5. post-verify the decoded (plain) checksum (optional) 6. place the plain text data in the page cache 2,5 may apply to different use cases though. > ... > > Compression is where using xattrs gets interesting - the xattrs can > have a fixed "offset" they blong to, but can store variable sized > data records for that offset. > > If we say we have a 64kB compression block size, we can store the > compressed data for a 64k block entirely in a remote xattr even if > compression fails (i.e. we can store the raw data, not the expanded > "compressed" data). The remote xattr can store any amount of smaller > data, and we map the compressed data directly into the page cache at > a high offset. Then decompression can run on the high offset pages > with the destination being some other page cache offset.... but compressed data itself can also be multiple reference (reflink likewise), so currently EROFS uses a seperate pseudo inode if it decides with physical addresses as indexes. > > On the write side, compression can be done directly into the high > offset page cache range for that 64kb offset range, then we can > map that to a remote xattr block and write the xattr. The xattr > naturally handles variable size blocks. Also different from plain text, each compression fses may keep different encoded data forms (e.g. fses could add headers or trailers to the on-disk compressed data or add more informations to extent metadata) for their own needs. So unlike the current unique plain text process, an unique encoder/decoder may not sound quite flexible though. Thanks, Gao Xiang
[FYI, your email got classified as spam by gmail...] On Fri, Oct 11, 2024 at 11:28:42AM +0800, Gao Xiang wrote: > Hi Dave, > > On 2024/10/11 08:43, Dave Chinner wrote: > > On Thu, Oct 10, 2024 at 02:10:25PM -0400, Goldwyn Rodrigues wrote: > > ... > > > > > .... there is specific ordering needed. > > > > For writes, the ordering is: > > > > 1. pre-write data compression - requires data copy > > 2. pre-write data encryption - requires data copy > > 3. pre-write data checksums - data read only > > 4. write the data > > 5. post-write metadata updates > > > > We cannot usefully perform compression after encryption - > > random data doesn't compress - and the checksum must match what is > > written to disk, so it has to come after all other transformations > > have been done. > > > > For reads, the order is: > > > > 1. read the data > > 2. verify the data checksum > > 3. decrypt the data - requires data copy > > 4. decompress the data - requires data copy > > 5. place the plain text data in the page cache > > Just random stuffs for for reference, currently fsverity makes > markle tree for the plain text, Well, that is specifically an existing implementation detail - the fsverity core does not care what data is asked to measure as long as it is the same data that it is asked to verify. Hence a filesystem could ask fsverity to measure compressed, encrypted data, and as long as the filesystem also asks fsverity to measure the same compressed, encrypted data as it is read from disk it will work as expected. We could do this quite easily - hand the compressed data record to fsverity one fsblock sized chunk at a time, and treat the empty regions between the end of the compressed record and the offset of the start of the next compressed record as a hole.... So, yeah, I think that fsverity can be placed at the at the "verify data on disk" layer successfully rather than at the "verify plain text" layer without actually impacting on it's functionality. .... > > Compression is where using xattrs gets interesting - the xattrs can > > have a fixed "offset" they blong to, but can store variable sized > > data records for that offset. > > > > If we say we have a 64kB compression block size, we can store the > > compressed data for a 64k block entirely in a remote xattr even if > > compression fails (i.e. we can store the raw data, not the expanded > > "compressed" data). The remote xattr can store any amount of smaller > > data, and we map the compressed data directly into the page cache at > > a high offset. Then decompression can run on the high offset pages > > with the destination being some other page cache offset.... > > but compressed data itself can also be multiple reference (reflink > likewise), so currently EROFS uses a seperate pseudo inode if it > decides with physical addresses as indexes. Sure, but handling shared data extents and breaking of shared mappings on write is not an iomap/page cache problem - that's a problem the filesystem block mapping operations that iomap calls need to handle. EROFS uses a separate pseudo inode so taht it can share page cache as well as shared blocks on disk. I don't think that compression changes that at all - the page cache contents for all those blocks are still going to be identical... As for the case of shared compressed data extents in XFS, I think that shared status just needs a shared bit to added to the remote xattr extent record header. Nothing else will really have to change, because xattr record overwrites are naturally copy-on-write. Hence updating a record will always break sharing, and the "shared bit" simply propagates into the block freeing operation to indicate a refcount update for the blocks being freed is needed. I don't see supporting FICLONE on compressed inodes as a major issue. > > On the write side, compression can be done directly into the high > > offset page cache range for that 64kb offset range, then we can > > map that to a remote xattr block and write the xattr. The xattr > > naturally handles variable size blocks. > > Also different from plain text, each compression fses may keep > different encoded data forms (e.g. fses could add headers or > trailers to the on-disk compressed data or add more informations > to extent metadata) for their own needs.i Sure, but that's something that the filesystem can add when encoding the data into the page cache. iomap treats the contents of the page caceh as entirely opaque - how "transformed" data is encoded in the destination folios is completely up to the filesystem doing the transformation. All iomap needs to care about is the offset and length of the opaque transformed data the filesystem needs to reside in the cache to perform the transformation. i.e. The example I gave above for XFS compression doesn't need metadata in the page cache data because it is held in an externally indexed xattr btree record. That's an XFS compression implementation detail, not an iomap concern. -Dave.
On 2024/10/11 12:52, Dave Chinner wrote: > [FYI, your email got classified as spam by gmail...] (I know.. yet that is the only permitted way to send email at work..) > > On Fri, Oct 11, 2024 at 11:28:42AM +0800, Gao Xiang wrote: >> Hi Dave, >> >> On 2024/10/11 08:43, Dave Chinner wrote: >>> On Thu, Oct 10, 2024 at 02:10:25PM -0400, Goldwyn Rodrigues wrote: >> >> ... >> >>> >>> .... there is specific ordering needed. >>> >>> For writes, the ordering is: >>> >>> 1. pre-write data compression - requires data copy >>> 2. pre-write data encryption - requires data copy >>> 3. pre-write data checksums - data read only >>> 4. write the data >>> 5. post-write metadata updates >>> >>> We cannot usefully perform compression after encryption - >>> random data doesn't compress - and the checksum must match what is >>> written to disk, so it has to come after all other transformations >>> have been done. >>> >>> For reads, the order is: >>> >>> 1. read the data >>> 2. verify the data checksum >>> 3. decrypt the data - requires data copy >>> 4. decompress the data - requires data copy >>> 5. place the plain text data in the page cache >> >> Just random stuffs for for reference, currently fsverity makes >> markle tree for the plain text, > > Well, that is specifically an existing implementation detail - > the fsverity core does not care what data is asked to measure as long > as it is the same data that it is asked to verify. > > Hence a filesystem could ask fsverity to measure compressed, > encrypted data, and as long as the filesystem also asks fsverity to > measure the same compressed, encrypted data as it is read from disk > it will work as expected. > > We could do this quite easily - hand the compressed data record > to fsverity one fsblock sized chunk at a time, and treat the empty > regions between the end of the compressed record and the offset > of the start of the next compressed record as a hole.... .. honestly I'm not quite sure that is an implementation detail, for example, currently userspace can get the root hash digest of files to check the identical files, such as the same data A: A + LZ4 = A1 A + DEFLATE = A2 A + Zstd = A3 All three files will have the same root digest for the current fsverity use cases, but if merkle trees are applied to transformed data, that will be difference and might not meet some users' use cases anyway. > > So, yeah, I think that fsverity can be placed at the at the "verify > data on disk" layer successfully rather than at the "verify plain > text" layer without actually impacting on it's functionality. > > .... >>> Compression is where using xattrs gets interesting - the xattrs can >>> have a fixed "offset" they blong to, but can store variable sized >>> data records for that offset. >>> >>> If we say we have a 64kB compression block size, we can store the >>> compressed data for a 64k block entirely in a remote xattr even if >>> compression fails (i.e. we can store the raw data, not the expanded >>> "compressed" data). The remote xattr can store any amount of smaller >>> data, and we map the compressed data directly into the page cache at >>> a high offset. Then decompression can run on the high offset pages >>> with the destination being some other page cache offset.... >> >> but compressed data itself can also be multiple reference (reflink >> likewise), so currently EROFS uses a seperate pseudo inode if it >> decides with physical addresses as indexes. > > Sure, but handling shared data extents and breaking of shared > mappings on write is not an iomap/page cache problem - that's a > problem the filesystem block mapping operations that iomap calls > need to handle. > > EROFS uses a separate pseudo inode so taht it can share page cache > as well as shared blocks on disk. I don't think that compression > changes that at all - the page cache contents for all those blocks > are still going to be identical... > > As for the case of shared compressed data extents in XFS, I think > that shared status just needs a shared bit to added to the remote > xattr extent record header. Nothing else will really have to change, > because xattr record overwrites are naturally copy-on-write. Hence > updating a record will always break sharing, and the "shared bit" > simply propagates into the block freeing operation to indicate a > refcount update for the blocks being freed is needed. I don't see > supporting FICLONE on compressed inodes as a major issue. Yes, I agree for XFS on-disk format it's quite easy. My comment related to a minor runtime point: "compressed data directly into the page cache at a high offset". That is if a separate pseudo inode is used to contain cached compressed data, it will take the only one copy and one I/O for shared compressed data if cache decompression is used.. Anyway, that is XFS's proposal, so that was my minor comment through. > >>> On the write side, compression can be done directly into the high >>> offset page cache range for that 64kb offset range, then we can >>> map that to a remote xattr block and write the xattr. The xattr >>> naturally handles variable size blocks. >> >> Also different from plain text, each compression fses may keep >> different encoded data forms (e.g. fses could add headers or >> trailers to the on-disk compressed data or add more informations >> to extent metadata) for their own needs.i > > Sure, but that's something that the filesystem can add when encoding > the data into the page cache. iomap treats the contents of the page > caceh as entirely opaque - how "transformed" data is encoded in the > destination folios is completely up to the filesystem doing the > transformation. All iomap needs to care about is the offset and > length of the opaque transformed data the filesystem needs to reside > in the cache to perform the transformation. > > i.e. The example I gave above for XFS compression doesn't need > metadata in the page cache data because it is held in an externally > indexed xattr btree record. That's an XFS compression implementation > detail, not an iomap concern. Got it. Thanks, Gao Xiang > > -Dave.
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c index 4c734899a8e5..ef805730125a 100644 --- a/fs/iomap/buffered-io.c +++ b/fs/iomap/buffered-io.c @@ -359,6 +359,7 @@ static int iomap_read_inline_data(const struct iomap_iter *iter, const struct iomap *iomap = iomap_iter_srcmap(iter); size_t size = i_size_read(iter->inode) - iomap->offset; size_t offset = offset_in_folio(folio, iomap->offset); + int ret = 0; if (folio_test_uptodate(folio)) return 0; @@ -368,9 +369,14 @@ static int iomap_read_inline_data(const struct iomap_iter *iter, if (offset > 0) ifs_alloc(iter->inode, folio, iter->flags); - folio_fill_tail(folio, offset, iomap->inline_data, size); - iomap_set_range_uptodate(folio, offset, folio_size(folio) - offset); - return 0; + if (iomap->folio_ops && iomap->folio_ops->read_inline) + ret = iomap->folio_ops->read_inline(iomap, folio); + else + folio_fill_tail(folio, offset, iomap->inline_data, size); + + if (!ret) + iomap_set_range_uptodate(folio, offset, folio_size(folio) - offset); + return ret; } static inline bool iomap_block_needs_zeroing(const struct iomap_iter *iter, diff --git a/include/linux/iomap.h b/include/linux/iomap.h index a5cf00a01f23..82dabc0369cd 100644 --- a/include/linux/iomap.h +++ b/include/linux/iomap.h @@ -163,6 +163,13 @@ struct iomap_folio_ops { * locked by the iomap code. */ bool (*iomap_valid)(struct inode *inode, const struct iomap *iomap); + + /* + * Custom read_inline function for filesystem such as btrfs + * that may store data in compressed form. + */ + + int (*read_inline)(const struct iomap *iomap, struct folio *folio); }; /*