[06/12] iomap: Introduce read_inline() function hook

Message ID	8147ae0a45b9851eacad4e8f5a71b7997c23bdd0.1728071257.git.rgoldwyn@suse.com (mailing list archive)
State	New
Headers	show Received: from smtp-out2.suse.de (smtp-out2.suse.de [195.135.223.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3C31B1DDC32; Fri, 4 Oct 2024 20:05:34 +0000 (UTC) From: Goldwyn Rodrigues <rgoldwyn@suse.de> To: linux-kernel@vger.kernel.org Cc: linux-fsdevel@vger.kernel.org, Goldwyn Rodrigues <rgoldwyn@suse.com> Subject: [PATCH 06/12] iomap: Introduce read_inline() function hook Date: Fri, 4 Oct 2024 16:04:33 -0400 Message-ID: <8147ae0a45b9851eacad4e8f5a71b7997c23bdd0.1728071257.git.rgoldwyn@suse.com> In-Reply-To: <cover.1728071257.git.rgoldwyn@suse.com> References: <cover.1728071257.git.rgoldwyn@suse.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit default: False [-3.80 / 50.00]; BAYES_HAM(-3.00)[100.00%]; NEURAL_HAM_LONG(-1.00)[-1.000]; R_MISSING_CHARSET(0.50)[]; NEURAL_HAM_SHORT(-0.20)[-1.000]; MIME_GOOD(-0.10)[text/plain]; TO_DN_SOME(0.00)[]; MIME_TRACE(0.00)[0:+]; RCVD_VIA_SMTP_AUTH(0.00)[]; ARC_NA(0.00)[]; DKIM_SIGNED(0.00)[suse.de:s=susede2_rsa,suse.de:s=susede2_ed25519]; FUZZY_BLOCKED(0.00)[rspamd.com]; TO_MATCH_ENVRCPT_ALL(0.00)[]; FROM_HAS_DN(0.00)[]; RCPT_COUNT_THREE(0.00)[3]; FROM_EQ_ENVFROM(0.00)[]; DBL_BLOCKED_OPENRESOLVER(0.00)[suse.com:email,suse.com:mid]; RCVD_COUNT_TWO(0.00)[2]; RCVD_TLS_ALL(0.00)[]
Series	btrfs reads through iomap \| expand [00/12] btrfs reads through iomap [01/12] iomap: check if folio size is equal to FS block size [02/12] iomap: Introduce iomap_read_folio_ops [03/12] iomap: add bioset in iomap_read_folio_ops for filesystems to use own bioset [04/12] iomap: include iomap_read_end_io() in header [05/12] iomap: Introduce IOMAP_ENCODED [06/12] iomap: Introduce read_inline() function hook [07/12] btrfs: btrfs_em_to_iomap() to convert em to iomap [08/12] btrfs: iomap_begin() for buffered reads [09/12] btrfs: define btrfs_iomap_read_folio_ops [10/12] btrfs: define btrfs_iomap_folio_ops [11/12] btrfs: add read_inline for folio operations for read() calls [12/12] btrfs: switch to iomap for buffered reads

Goldwyn Rodrigues Oct. 4, 2024, 8:04 p.m. UTC

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

Introduce read_inline() function hook for reading inline extents. This
is performed for filesystems such as btrfs which may compress the data
in the inline extents.

This is added in struct iomap_folio_ops, since folio is available at
this point.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/iomap/buffered-io.c | 12 +++++++++---
 include/linux/iomap.h  |  7 +++++++
 2 files changed, 16 insertions(+), 3 deletions(-)

Matthew Wilcox Oct. 5, 2024, 2:30 a.m. UTC | #1

On Fri, Oct 04, 2024 at 04:04:33PM -0400, Goldwyn Rodrigues wrote:
> Introduce read_inline() function hook for reading inline extents. This
> is performed for filesystems such as btrfs which may compress the data
> in the inline extents.

This feels like an attempt to work around "iomap doesn't support
compressed extents" by keeping the decompression in the filesystem,
instead of extending iomap to support compressed extents itself.
I'd certainly prefer iomap to support compressed extents, but maybe I'm
in a minority here.

Darrick J. Wong Oct. 7, 2024, 5:47 p.m. UTC | #2

On Sat, Oct 05, 2024 at 03:30:53AM +0100, Matthew Wilcox wrote:
> On Fri, Oct 04, 2024 at 04:04:33PM -0400, Goldwyn Rodrigues wrote:
> > Introduce read_inline() function hook for reading inline extents. This
> > is performed for filesystems such as btrfs which may compress the data
> > in the inline extents.

Why don't you set iomap->inline_data to the uncompressed buffer, let
readahead copy it to the pagecache, and free it in ->iomap_end?

> This feels like an attempt to work around "iomap doesn't support
> compressed extents" by keeping the decompression in the filesystem,
> instead of extending iomap to support compressed extents itself.
> I'd certainly prefer iomap to support compressed extents, but maybe I'm
> in a minority here.

I'm not an expert on fs compression, but I get the impression that most
filesystems handle reads by allocating some folios, reading off the disk
into those folios, and decompressing into the pagecache folios.  It
might be kind of amusing to try to hoist that into the vfs/iomap at some
point, but I think the next problem you'd run into is that fscrypt has
similar requirements, since it's also a data transformation step.
fsverity I think is less complicated because it only needs to read the
pagecache contents at the very end to check it against the merkle tree.

That, I think, is why this btrfs iomap port want to override submit_bio,
right?  So that it can allocate a separate set of folios, create a
second bio around that, submit the second bio, decode what it read off
the disk into the folios of the first bio, then "complete" the first bio
so that iomap only has to update the pagecache state and doesn't have to
know about the encoding magic?

And then, having established that beachhead, porting btrfs fscrypt is
a simple matter of adding more transformation steps to the ioend
processing of the second bio (aka the one that actually calls the disk),
right?  And I think all that processing stuff is more or less already in
place for the existing read path, so it should be trivial (ha!) to
call it in an iomap context instead of straight from btrfs.
iomap_folio_state notwithstanding, of course.

Hmm.  I'll have to give some thought to what would the ideal iomap data
transformation pipeline look like?

Though I already have a sneaking suspicion that will morph into "If I
wanted to add {crypt,verity,compression} to xfs how would I do that?" ->
"How would I design a pipeine to handle all three to avoid bouncing
pages between workqueue threads like ext4 does?" -> "Oh gosh now I have
a totally different design than any of the existing implementations." ->
"Well, crumbs. :("

I'll start that by asking: Hey btrfs developers, what do you like and
hate about the current way that btrfs handles fscrypt, compression, and
fsverity?  Assuming that you can set all three on a file, right?

--D

Goldwyn Rodrigues Oct. 10, 2024, 6:10 p.m. UTC | #3

On 10:47 07/10, Darrick J. Wong wrote:
> On Sat, Oct 05, 2024 at 03:30:53AM +0100, Matthew Wilcox wrote:
> > On Fri, Oct 04, 2024 at 04:04:33PM -0400, Goldwyn Rodrigues wrote:
> > > Introduce read_inline() function hook for reading inline extents. This
> > > is performed for filesystems such as btrfs which may compress the data
> > > in the inline extents.
> 
> Why don't you set iomap->inline_data to the uncompressed buffer, let
> readahead copy it to the pagecache, and free it in ->iomap_end?

This will increase the number of copies. BTRFS uncompresses directly
into pagecache. Yes, this is an option but at the cost of efficiency.

> 
> > This feels like an attempt to work around "iomap doesn't support
> > compressed extents" by keeping the decompression in the filesystem,
> > instead of extending iomap to support compressed extents itself.
> > I'd certainly prefer iomap to support compressed extents, but maybe I'm
> > in a minority here.
> 
> I'm not an expert on fs compression, but I get the impression that most
> filesystems handle reads by allocating some folios, reading off the disk
> into those folios, and decompressing into the pagecache folios.  It
> might be kind of amusing to try to hoist that into the vfs/iomap at some
> point, but I think the next problem you'd run into is that fscrypt has
> similar requirements, since it's also a data transformation step.
> fsverity I think is less complicated because it only needs to read the
> pagecache contents at the very end to check it against the merkle tree.
> 
> That, I think, is why this btrfs iomap port want to override submit_bio,
> right?  So that it can allocate a separate set of folios, create a
> second bio around that, submit the second bio, decode what it read off
> the disk into the folios of the first bio, then "complete" the first bio
> so that iomap only has to update the pagecache state and doesn't have to
> know about the encoding magic?

Yes, but that is not the only reason. BTRFS also calculates and checks
block checksums for data read during bio completions.

> 
> And then, having established that beachhead, porting btrfs fscrypt is
> a simple matter of adding more transformation steps to the ioend
> processing of the second bio (aka the one that actually calls the disk),
> right?  And I think all that processing stuff is more or less already in
> place for the existing read path, so it should be trivial (ha!) to
> call it in an iomap context instead of straight from btrfs.
> iomap_folio_state notwithstanding, of course.
> 
> Hmm.  I'll have to give some thought to what would the ideal iomap data
> transformation pipeline look like?

The order of transformation would make all the difference, and I am not
sure if everyone involved can come to a conclusion that all
transformations should be done in a particular decided order.

FWIW, checksums are performed on what is read/written on disk. So
for writes, compression happens before checksums.

> 
> Though I already have a sneaking suspicion that will morph into "If I
> wanted to add {crypt,verity,compression} to xfs how would I do that?" ->
> "How would I design a pipeine to handle all three to avoid bouncing
> pages between workqueue threads like ext4 does?" -> "Oh gosh now I have
> a totally different design than any of the existing implementations." ->
> "Well, crumbs. :("
> 
> I'll start that by asking: Hey btrfs developers, what do you like and
> hate about the current way that btrfs handles fscrypt, compression, and
> fsverity?  Assuming that you can set all three on a file, right?

Dave Chinner Oct. 11, 2024, 12:43 a.m. UTC | #4

On Thu, Oct 10, 2024 at 02:10:25PM -0400, Goldwyn Rodrigues wrote:
> On 10:47 07/10, Darrick J. Wong wrote:
> > On Sat, Oct 05, 2024 at 03:30:53AM +0100, Matthew Wilcox wrote:
> > > On Fri, Oct 04, 2024 at 04:04:33PM -0400, Goldwyn Rodrigues wrote:
> > > > Introduce read_inline() function hook for reading inline extents. This
> > > > is performed for filesystems such as btrfs which may compress the data
> > > > in the inline extents.
> > 
> > Why don't you set iomap->inline_data to the uncompressed buffer, let
> > readahead copy it to the pagecache, and free it in ->iomap_end?
> 
> This will increase the number of copies. BTRFS uncompresses directly
> into pagecache. Yes, this is an option but at the cost of efficiency.

Is that such a problem? The process of decompression means the data
is already hot in cache, so there isn't a huge extra penalty for
moving this inline data a second time...

> > > This feels like an attempt to work around "iomap doesn't support
> > > compressed extents" by keeping the decompression in the filesystem,
> > > instead of extending iomap to support compressed extents itself.
> > > I'd certainly prefer iomap to support compressed extents, but maybe I'm
> > > in a minority here.
> > 
> > I'm not an expert on fs compression, but I get the impression that most
> > filesystems handle reads by allocating some folios, reading off the disk
> > into those folios, and decompressing into the pagecache folios.  It
> > might be kind of amusing to try to hoist that into the vfs/iomap at some
> > point, but I think the next problem you'd run into is that fscrypt has
> > similar requirements, since it's also a data transformation step.
> > fsverity I think is less complicated because it only needs to read the
> > pagecache contents at the very end to check it against the merkle tree.
> > 
> > That, I think, is why this btrfs iomap port want to override submit_bio,
> > right?  So that it can allocate a separate set of folios, create a
> > second bio around that, submit the second bio, decode what it read off
> > the disk into the folios of the first bio, then "complete" the first bio
> > so that iomap only has to update the pagecache state and doesn't have to
> > know about the encoding magic?
> 
> Yes, but that is not the only reason. BTRFS also calculates and checks
> block checksums for data read during bio completions.

This is no different to doing fsverity checks at read BIO io
completion. We should be using the same mechanism in iomap for
fsverity IO completion verification as filesystems do for data
checksum verification because, conceptually speaking, they are the
same operation.

> > And then, having established that beachhead, porting btrfs fscrypt is
> > a simple matter of adding more transformation steps to the ioend
> > processing of the second bio (aka the one that actually calls the disk),
> > right?  And I think all that processing stuff is more or less already in
> > place for the existing read path, so it should be trivial (ha!) to
> > call it in an iomap context instead of straight from btrfs.
> > iomap_folio_state notwithstanding, of course.
> > 
> > Hmm.  I'll have to give some thought to what would the ideal iomap data
> > transformation pipeline look like?
> 
> The order of transformation would make all the difference, and I am not
> sure if everyone involved can come to a conclusion that all
> transformations should be done in a particular decided order.

I think there is only one viable order of data transformations
to/from disk. That's because....

> FWIW, checksums are performed on what is read/written on disk. So
> for writes, compression happens before checksums.

.... there is specific ordering needed.

For writes, the ordering is:

	1. pre-write data compression - requires data copy
	2. pre-write data encryption - requires data copy
	3. pre-write data checksums - data read only
	4. write the data
	5. post-write metadata updates

We cannot usefully perform compression after encryption -
random data doesn't compress - and the checksum must match what is
written to disk, so it has to come after all other transformations
have been done.

For reads, the order is:

	1. read the data
	2. verify the data checksum
	3. decrypt the data - requires data copy
	4. decompress the data - requires data copy
	5. place the plain text data in the page cache

To do 1) we need memory buffers allocated, 2) runs directly out of
them. If no other transformations are required, then we can read the
data directly into the folios in the page cache.

However, if we have to decrypt and/or decompress the data, then
we need multiple sets of bounce buffers - one of the data being
read and one of the decrypted data. The data can be decompressed
directly into the page cache.

If there is no compression, the decryption should be done directly
into the page cache.

If there is nor decryption or compression, then the read IO should
be done directly into the page cache and the checksum verification
done by reading out of the page cache.

IOWs, each step of the data read path needs to have "buffer" that
is a set of folios that may or may not be the final page cache
location.

The other consideration here is we may want to be able to cache
these data transformation layers when we are doing rapid RMW
operations. e.g. on encrypted data, a RMW will need to allocate
two sets of bounce buffers - one for the read, another for the
write. If the encrypted data is also hosted in the page cache (at
some high offset beyond EOF) then we don't need to allocate the
bounce buffer during write....

> > Though I already have a sneaking suspicion that will morph into "If I
> > wanted to add {crypt,verity,compression} to xfs how would I do that?" ->
> > "How would I design a pipeine to handle all three to avoid bouncing
> > pages between workqueue threads like ext4 does?" -> "Oh gosh now I have
> > a totally different design than any of the existing implementations." ->
> > "Well, crumbs. :("

I've actually been thinking about how to do data CRCs, encryption
and compression with XFS through iomap. I've even started writing a
design doc, based on feedback from the first fsverity implementation
attempt.

Andrey and I are essentially working towards a "direct mapped remote
xattr" model for fsverity merkle tree data. fsverity reads and
writes are marshalled through the page cache at a filesystem
determined offset beyond EOF (same model as ext4, et al), but the
data is mapped into remote xattr blocks. This requires fixing the
mistake of putting metadata headers into xattr remote blocks by
moving the CRC to the remote xattr name structure such that remote
xattrs only contain data again.

The read/write ops that are passed to iomap for such IO are fsverity
implementation specific that understand that the high offset is to
be mapped directly to a filesystem block sized remote xattr extent
and the IO is done directly on that block. The remote xattr block
CRC can be retreived when it is mapped and validated on read IO
completion by the iomap code before passing the data to fsverity.

The reason for doing using xattrs in this way is that it provides a
generic mechanism for storing multiple sets of optional data related
and/or size-transformed data within a file and within the single
page caceh address space the inode provides.

For example, it allows XFS to implement native data checksums. For
this, we need to store 2 x 32bit CRCs per filesystem block. i.e. we
need the old CRC and the new CRC so we can write/log the CRC update
before we write the data, and if we crash before the data hits the
disk we still know what the original CRC was and so can validate
that the filesystem block contains entirely either the old or the
new data when it is next read. i.e. one of the two CRC values should
always be valid.

By using direct mapping into the page cache for CRCs, we don't need
to continually refetch the checksum data. It gets read in once, and
a large read will continue to pull CRC data from the cached folio as
we walk through the data we read from disk.  We can fetch the CRC
data concurrently with the data itself, and IO completion processing
doesn't signalled until both have been brought into cache.

For writes, we can calculate the new CRCs sequentially and flush the
cached CRC page but to the xattr when we reach the end of it. The
data write bio that was built as we CRC'd the data can then be
flushed once the xattr data has reached stable storage. There's a
bit more latency to writeback operations here, but nothing is ever
free...

Compression is where using xattrs gets interesting - the xattrs can
have a fixed "offset" they blong to, but can store variable sized
data records for that offset.

If we say we have a 64kB compression block size, we can store the
compressed data for a 64k block entirely in a remote xattr even if
compression fails (i.e. we can store the raw data, not the expanded
"compressed" data). The remote xattr can store any amount of smaller
data, and we map the compressed data directly into the page cache at
a high offset. Then decompression can run on the high offset pages
with the destination being some other page cache offset....

On the write side, compression can be done directly into the high
offset page cache range for that 64kb offset range, then we can
map that to a remote xattr block and write the xattr. The xattr
naturally handles variable size blocks.

If we overwrite the compressed data (e.g. a 4kB RMW cycle in a 64kB
block), then we essentially overwrite the compressed data in the
page cache at writeback, extending the folio set for that record
if it grows. Then on flush of the compressed record, we atomically
replace the remote xattr we already have so we are guaranteed that
we'll see either the old or the new compressed data at that point in
time. The new remote xattr block is CRC'd by the xattr code, so if
we are using compression-only, we don't actually need separate
on-disk data CRC xattrs...

Encryption/decryption doesn't require alternate data storage, so
that just requires reading the encrypted data into a high page cache
offset. However, if we are compressing then encrypting, then after
the encryption step we'd need to store it via the compression
method, not the uncompressed method. Hence there are some
implementation details needed to be worked through here.

-----

Overall, I'm looking an iomap mechanism that stages every part of
the transformation stack in the page cache itself. We have address
space to burn because we don't support file sizes larger than 8EiB.
That leaves a full 8EiB of address space available for hosting
non-user-visible ephemeral IO path data states.

How to break up that address space for different data transformation
uses is an open question. Files with transformed data would be
limited to the size of the address space reserved by iomap for
transformed data, so we want it to be large.

XFS can't use more than 2^32 -extents- for xattrs on an inode at
this point in time. That means >4 billion alternate data records can
be stored, not that 256TB is the maximum offset that can be
supported.

Hence I'm tending towards at least 1PB per alternate address
range, which would limit encrypted and/or compressed files to this
size. That leaves over 8000 alternate address ranges iomap can
support, but I suspect that we'll want fewer, larger ranges in
preference to more, smaller ranges....

> > I'll start that by asking: Hey btrfs developers, what do you like and
> > hate about the current way that btrfs handles fscrypt, compression, and
> > fsverity?  Assuming that you can set all three on a file, right?

I'm definitely assuming that all 3 can be set at once, as well as
low level filesystem data CRCs underneath all the layered "VFS"
stuff.

However, what layers we actually need at any point in time can be
malleable. For XFS, compression would naturally be CRC'd by the
xattr storage so it wouldn't need that layer at all. If fsverity is
enabled and fs admin tooling understands fsverity, then we probably
don't need fs checksums for fsverity files.

so there definitely needs to be some flexibility in the stack to
allow filesystems to elide transformation layers that are
redundant...

-Dave.

Gao Xiang Oct. 11, 2024, 3:28 a.m. UTC | #5

Hi Dave,

On 2024/10/11 08:43, Dave Chinner wrote:
> On Thu, Oct 10, 2024 at 02:10:25PM -0400, Goldwyn Rodrigues wrote:

...

> 
> .... there is specific ordering needed.
> 
> For writes, the ordering is:
> 
> 	1. pre-write data compression - requires data copy
> 	2. pre-write data encryption - requires data copy
> 	3. pre-write data checksums - data read only
> 	4. write the data
> 	5. post-write metadata updates
> 
> We cannot usefully perform compression after encryption -
> random data doesn't compress - and the checksum must match what is
> written to disk, so it has to come after all other transformations
> have been done.
> 
> For reads, the order is:
> 
> 	1. read the data
> 	2. verify the data checksum
> 	3. decrypt the data - requires data copy
> 	4. decompress the data - requires data copy
> 	5. place the plain text data in the page cache

Just random stuffs for for reference, currently fsverity makes
markle tree for the plain text, but from the on-disk data
security/authentication and integrity perspective, I guess the
order you mentioned sounds more saner to me, or:

  	1. read the data
  	2. pre-verify the encoded data checksum (optional,
                                        dm-verity likewise way)
  	3. decrypt the data - requires data copy
  	4. decompress the data - requires data copy
	5. post-verify the decoded (plain) checksum (optional)
  	6. place the plain text data in the page cache

2,5 may apply to different use cases though.

> 

...

> 
> Compression is where using xattrs gets interesting - the xattrs can
> have a fixed "offset" they blong to, but can store variable sized
> data records for that offset.
> 
> If we say we have a 64kB compression block size, we can store the
> compressed data for a 64k block entirely in a remote xattr even if
> compression fails (i.e. we can store the raw data, not the expanded
> "compressed" data). The remote xattr can store any amount of smaller
> data, and we map the compressed data directly into the page cache at
> a high offset. Then decompression can run on the high offset pages
> with the destination being some other page cache offset....

but compressed data itself can also be multiple reference (reflink
likewise), so currently EROFS uses a seperate pseudo inode if it
decides with physical addresses as indexes.

> 
> On the write side, compression can be done directly into the high
> offset page cache range for that 64kb offset range, then we can
> map that to a remote xattr block and write the xattr. The xattr
> naturally handles variable size blocks.

Also different from plain text, each compression fses may keep
different encoded data forms (e.g. fses could add headers or
trailers to the on-disk compressed data or add more informations
to extent metadata) for their own needs.  So unlike the current
unique plain text process, an unique encoder/decoder may not sound
quite flexible though.

Thanks,
Gao Xiang

Dave Chinner Oct. 11, 2024, 4:52 a.m. UTC | #6

[FYI, your email got classified as spam by gmail...]

On Fri, Oct 11, 2024 at 11:28:42AM +0800, Gao Xiang wrote:
> Hi Dave,
> 
> On 2024/10/11 08:43, Dave Chinner wrote:
> > On Thu, Oct 10, 2024 at 02:10:25PM -0400, Goldwyn Rodrigues wrote:
> 
> ...
> 
> > 
> > .... there is specific ordering needed.
> > 
> > For writes, the ordering is:
> > 
> > 	1. pre-write data compression - requires data copy
> > 	2. pre-write data encryption - requires data copy
> > 	3. pre-write data checksums - data read only
> > 	4. write the data
> > 	5. post-write metadata updates
> > 
> > We cannot usefully perform compression after encryption -
> > random data doesn't compress - and the checksum must match what is
> > written to disk, so it has to come after all other transformations
> > have been done.
> > 
> > For reads, the order is:
> > 
> > 	1. read the data
> > 	2. verify the data checksum
> > 	3. decrypt the data - requires data copy
> > 	4. decompress the data - requires data copy
> > 	5. place the plain text data in the page cache
> 
> Just random stuffs for for reference, currently fsverity makes
> markle tree for the plain text,

Well, that is specifically an existing implementation detail -
the fsverity core does not care what data is asked to measure as long
as it is the same data that it is asked to verify.

Hence a filesystem could ask fsverity to measure compressed,
encrypted data, and as long as the filesystem also asks fsverity to
measure the same compressed, encrypted data as it is read from disk
it will work as expected.

We could do this quite easily - hand the compressed data record
to fsverity one fsblock sized chunk at a time, and treat the empty
regions between the end of the compressed record and the offset
of the start of the next compressed record as a hole....

So, yeah, I think that fsverity can be placed at the at the "verify
data on disk" layer successfully rather than at the "verify plain
text" layer without actually impacting on it's functionality.

....
> > Compression is where using xattrs gets interesting - the xattrs can
> > have a fixed "offset" they blong to, but can store variable sized
> > data records for that offset.
> > 
> > If we say we have a 64kB compression block size, we can store the
> > compressed data for a 64k block entirely in a remote xattr even if
> > compression fails (i.e. we can store the raw data, not the expanded
> > "compressed" data). The remote xattr can store any amount of smaller
> > data, and we map the compressed data directly into the page cache at
> > a high offset. Then decompression can run on the high offset pages
> > with the destination being some other page cache offset....
> 
> but compressed data itself can also be multiple reference (reflink
> likewise), so currently EROFS uses a seperate pseudo inode if it
> decides with physical addresses as indexes.

Sure, but handling shared data extents and breaking of shared
mappings on write is not an iomap/page cache problem - that's a
problem the filesystem block mapping operations that iomap calls
need to handle.

EROFS uses a separate pseudo inode so taht it can share page cache
as well as shared blocks on disk. I don't think that compression
changes that at all - the page cache contents for all those blocks
are still going to be identical...

As for the case of shared compressed data extents in XFS, I think
that shared status just needs a shared bit to added to the remote
xattr extent record header. Nothing else will really have to change,
because xattr record overwrites are naturally copy-on-write. Hence
updating a record will always break sharing, and the "shared bit"
simply propagates into the block freeing operation to indicate a
refcount update for the blocks being freed is needed. I don't see
supporting FICLONE on compressed inodes as a major issue.

> > On the write side, compression can be done directly into the high
> > offset page cache range for that 64kb offset range, then we can
> > map that to a remote xattr block and write the xattr. The xattr
> > naturally handles variable size blocks.
> 
> Also different from plain text, each compression fses may keep
> different encoded data forms (e.g. fses could add headers or
> trailers to the on-disk compressed data or add more informations
> to extent metadata) for their own needs.i

Sure, but that's something that the filesystem can add when encoding
the data into the page cache. iomap treats the contents of the page
caceh as entirely opaque - how "transformed" data is encoded in the
destination folios is completely up to the filesystem doing the
transformation. All iomap needs to care about is the offset and
length of the opaque transformed data the filesystem needs to reside
in the cache to perform the transformation.

i.e. The example I gave above for XFS compression doesn't need
metadata in the page cache data because it is held in an externally
indexed xattr btree record. That's an XFS compression implementation
detail, not an iomap concern.

-Dave.

Gao Xiang Oct. 11, 2024, 5:19 a.m. UTC | #7

On 2024/10/11 12:52, Dave Chinner wrote:
> [FYI, your email got classified as spam by gmail...]

(I know.. yet that is the only permitted way to send email at work..)

> 
> On Fri, Oct 11, 2024 at 11:28:42AM +0800, Gao Xiang wrote:
>> Hi Dave,
>>
>> On 2024/10/11 08:43, Dave Chinner wrote:
>>> On Thu, Oct 10, 2024 at 02:10:25PM -0400, Goldwyn Rodrigues wrote:
>>
>> ...
>>
>>>
>>> .... there is specific ordering needed.
>>>
>>> For writes, the ordering is:
>>>
>>> 	1. pre-write data compression - requires data copy
>>> 	2. pre-write data encryption - requires data copy
>>> 	3. pre-write data checksums - data read only
>>> 	4. write the data
>>> 	5. post-write metadata updates
>>>
>>> We cannot usefully perform compression after encryption -
>>> random data doesn't compress - and the checksum must match what is
>>> written to disk, so it has to come after all other transformations
>>> have been done.
>>>
>>> For reads, the order is:
>>>
>>> 	1. read the data
>>> 	2. verify the data checksum
>>> 	3. decrypt the data - requires data copy
>>> 	4. decompress the data - requires data copy
>>> 	5. place the plain text data in the page cache
>>
>> Just random stuffs for for reference, currently fsverity makes
>> markle tree for the plain text,
> 
> Well, that is specifically an existing implementation detail -
> the fsverity core does not care what data is asked to measure as long
> as it is the same data that it is asked to verify.
> 
> Hence a filesystem could ask fsverity to measure compressed,
> encrypted data, and as long as the filesystem also asks fsverity to
> measure the same compressed, encrypted data as it is read from disk
> it will work as expected.
> 
> We could do this quite easily - hand the compressed data record
> to fsverity one fsblock sized chunk at a time, and treat the empty
> regions between the end of the compressed record and the offset
> of the start of the next compressed record as a hole....

.. honestly I'm not quite sure that is an implementation detail,
for example, currently userspace can get the root hash digest of
files to check the identical files, such as the same data A:
   A + LZ4 = A1
   A + DEFLATE = A2
   A + Zstd = A3
All three files will have the same root digest for the current
fsverity use cases, but if merkle trees are applied to transformed
data, that will be difference and might not meet some users' use
cases anyway.

> 
> So, yeah, I think that fsverity can be placed at the at the "verify
> data on disk" layer successfully rather than at the "verify plain
> text" layer without actually impacting on it's functionality.
> 
> ....
>>> Compression is where using xattrs gets interesting - the xattrs can
>>> have a fixed "offset" they blong to, but can store variable sized
>>> data records for that offset.
>>>
>>> If we say we have a 64kB compression block size, we can store the
>>> compressed data for a 64k block entirely in a remote xattr even if
>>> compression fails (i.e. we can store the raw data, not the expanded
>>> "compressed" data). The remote xattr can store any amount of smaller
>>> data, and we map the compressed data directly into the page cache at
>>> a high offset. Then decompression can run on the high offset pages
>>> with the destination being some other page cache offset....
>>
>> but compressed data itself can also be multiple reference (reflink
>> likewise), so currently EROFS uses a seperate pseudo inode if it
>> decides with physical addresses as indexes.
> 
> Sure, but handling shared data extents and breaking of shared
> mappings on write is not an iomap/page cache problem - that's a
> problem the filesystem block mapping operations that iomap calls
> need to handle.
> 
> EROFS uses a separate pseudo inode so taht it can share page cache
> as well as shared blocks on disk. I don't think that compression
> changes that at all - the page cache contents for all those blocks
> are still going to be identical...
> 
> As for the case of shared compressed data extents in XFS, I think
> that shared status just needs a shared bit to added to the remote
> xattr extent record header. Nothing else will really have to change,
> because xattr record overwrites are naturally copy-on-write. Hence
> updating a record will always break sharing, and the "shared bit"
> simply propagates into the block freeing operation to indicate a
> refcount update for the blocks being freed is needed. I don't see
> supporting FICLONE on compressed inodes as a major issue.

Yes, I agree for XFS on-disk format it's quite easy.  My comment
related to a minor runtime point: "compressed data directly into
the page cache at a high offset".

That is if a separate pseudo inode is used to contain cached
compressed data, it will take the only one copy and one I/O for
shared compressed data if cache decompression is used..  Anyway,
that is XFS's proposal, so that was my minor comment through.

> 
>>> On the write side, compression can be done directly into the high
>>> offset page cache range for that 64kb offset range, then we can
>>> map that to a remote xattr block and write the xattr. The xattr
>>> naturally handles variable size blocks.
>>
>> Also different from plain text, each compression fses may keep
>> different encoded data forms (e.g. fses could add headers or
>> trailers to the on-disk compressed data or add more informations
>> to extent metadata) for their own needs.i
> 
> Sure, but that's something that the filesystem can add when encoding
> the data into the page cache. iomap treats the contents of the page
> caceh as entirely opaque - how "transformed" data is encoded in the
> destination folios is completely up to the filesystem doing the
> transformation. All iomap needs to care about is the offset and
> length of the opaque transformed data the filesystem needs to reside
> in the cache to perform the transformation.
> 
> i.e. The example I gave above for XFS compression doesn't need
> metadata in the page cache data because it is held in an externally
> indexed xattr btree record. That's an XFS compression implementation
> detail, not an iomap concern.

Got it.

Thanks,
Gao Xiang

> 
> -Dave.

[06/12] iomap: Introduce read_inline() function hook

Commit Message

Comments

Patch