diff mbox series

[18/26] xfs: use merkle tree offset as attr hash

Message ID 171444680671.957659.2149857258719599236.stgit@frogsfrogsfrogs (mailing list archive)
State New
Headers show
Series [01/26] xfs: use unsigned ints for non-negative quantities in xfs_attr_remote.c | expand

Commit Message

Darrick J. Wong April 30, 2024, 3:28 a.m. UTC
From: Darrick J. Wong <djwong@kernel.org>

I was exploring the fsverity metadata with xfs_db after creating a 220MB
verity file, and I noticed the following in the debugger output:

entries[0-75] = [hashval,nameidx,incomplete,root,secure,local,parent,verity]
0:[0,4076,0,0,0,0,0,1]
1:[0,1472,0,0,0,1,0,1]
2:[0x800,4056,0,0,0,0,0,1]
3:[0x800,4036,0,0,0,0,0,1]
...
72:[0x12000,2716,0,0,0,0,0,1]
73:[0x12000,2696,0,0,0,0,0,1]
74:[0x12800,2676,0,0,0,0,0,1]
75:[0x12800,2656,0,0,0,0,0,1]
...
nvlist[0].merkle_off = 0x18000
nvlist[1].merkle_off = 0
nvlist[2].merkle_off = 0x19000
nvlist[3].merkle_off = 0x1000
...
nvlist[71].merkle_off = 0x5b000
nvlist[72].merkle_off = 0x44000
nvlist[73].merkle_off = 0x5c000
nvlist[74].merkle_off = 0x45000
nvlist[75].merkle_off = 0x5d000

Within just this attr leaf block, there are 76 attr entries, but only 38
distinct hash values.  There are 415 merkle tree blocks for this file,
but we already have hash collisions.  This isn't good performance from
the standard da hash function because we're mostly shifting and rolling
zeroes around.

However, we don't even have to do that much work -- the merkle tree
block keys are themslves u64 values.  Truncate that value to 32 bits
(the size of xfs_dahash_t) and use that for the hash.  We won't have any
collisions between merkle tree blocks until that tree grows to 2^32nd
blocks.  On a 4k block filesystem, we won't hit that unless the file
contains more than 2^49 bytes, assuming sha256.

As a side effect, the keys for merkle tree blocks get written out in
roughly sequential order, though I didn't observe any change in
performance.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Andrey Albershteyn <aalbersh@redhat.com>
---
 fs/xfs/libxfs/xfs_attr.c      |    2 ++
 fs/xfs/libxfs/xfs_da_format.h |    6 ++++++
 fs/xfs/libxfs/xfs_verity.c    |   16 ++++++++++++++++
 fs/xfs/libxfs/xfs_verity.h    |    1 +
 4 files changed, 25 insertions(+)

Comments

Christoph Hellwig May 1, 2024, 6:53 a.m. UTC | #1
On Mon, Apr 29, 2024 at 08:28:48PM -0700, Darrick J. Wong wrote:
> Within just this attr leaf block, there are 76 attr entries, but only 38
> distinct hash values.  There are 415 merkle tree blocks for this file,
> but we already have hash collisions.  This isn't good performance from
> the standard da hash function because we're mostly shifting and rolling
> zeroes around.
> 
> However, we don't even have to do that much work -- the merkle tree
> block keys are themslves u64 values.  Truncate that value to 32 bits
> (the size of xfs_dahash_t) and use that for the hash.  We won't have any
> collisions between merkle tree blocks until that tree grows to 2^32nd
> blocks.  On a 4k block filesystem, we won't hit that unless the file
> contains more than 2^49 bytes, assuming sha256.
> 
> As a side effect, the keys for merkle tree blocks get written out in
> roughly sequential order, though I didn't observe any change in
> performance.

This and the header hacks suggest to me that shoe horning the fsverity
blocks into attrs just feels like the wrong approach.

They don't really behave like attrs, they aren't key/value paris that
are separate, but a large amount of same sized blocks with logical
indexing.  All that is actually nicely solved by the original fsverity
used by ext4/f2fs, while we have to pile workarounds ontop of
workarounds to make attrs work.
Christoph Hellwig May 1, 2024, 7:23 a.m. UTC | #2
On Tue, Apr 30, 2024 at 11:53:00PM -0700, Christoph Hellwig wrote:
> This and the header hacks suggest to me that shoe horning the fsverity
> blocks into attrs just feels like the wrong approach.
> 
> They don't really behave like attrs, they aren't key/value paris that
> are separate, but a large amount of same sized blocks with logical
> indexing.  All that is actually nicely solved by the original fsverity
> used by ext4/f2fs, while we have to pile workarounds ontop of
> workarounds to make attrs work.

Taking this a bit further:  If we want to avoid the problems associated
with the original scheme, mostly the file size limitation, and the (IMHO
more cosmetic than real) confusion with post-EOF preallocations, we
can still store the data in the attr fork, but not in the traditional
attr format.  The attr fork provides the logical index to physical
translation as the data fork, and while that is current only used for
dabtree blocks and remote attr values, that isn't actually a fundamental
requirement for using it.

All the attr fork placement works through xfs_bmap_first_unused() to
find completely random free space in the logic address space.

Now if we reserved say the high bit for verity blocks in verity enabled
file systems we can simply use the bmap btree to do the mapping from
the verity index to the on-disk verify blocks without any other impact
to the attr code.
Darrick J. Wong May 7, 2024, 9:24 p.m. UTC | #3
On Wed, May 01, 2024 at 12:23:02AM -0700, Christoph Hellwig wrote:
> On Tue, Apr 30, 2024 at 11:53:00PM -0700, Christoph Hellwig wrote:
> > This and the header hacks suggest to me that shoe horning the fsverity
> > blocks into attrs just feels like the wrong approach.
> > 
> > They don't really behave like attrs, they aren't key/value paris that
> > are separate, but a large amount of same sized blocks with logical
> > indexing.  All that is actually nicely solved by the original fsverity
> > used by ext4/f2fs, while we have to pile workarounds ontop of
> > workarounds to make attrs work.
> 
> Taking this a bit further:  If we want to avoid the problems associated
> with the original scheme, mostly the file size limitation, and the (IMHO
> more cosmetic than real) confusion with post-EOF preallocations, we
> can still store the data in the attr fork, but not in the traditional
> attr format.  The attr fork provides the logical index to physical
> translation as the data fork, and while that is current only used for
> dabtree blocks and remote attr values, that isn't actually a fundamental
> requirement for using it.
> 
> All the attr fork placement works through xfs_bmap_first_unused() to
> find completely random free space in the logic address space.
> 
> Now if we reserved say the high bit for verity blocks in verity enabled
> file systems we can simply use the bmap btree to do the mapping from
> the verity index to the on-disk verify blocks without any other impact
> to the attr code.

Since we know the size of the merkle data ahead of time, we could also
preallocate space in the attr fork and create a remote ATTR_VERITY xattr
named "merkle" that points to the allocated space.  Then we don't have
to have magic meanings for the high bit.

Though I guess the question is, given the format:

struct xfs_attr_leaf_name_remote {
	__be32	valueblk;		/* block number of value bytes */
	__be32	valuelen;		/* number of bytes in value */
	__u8	namelen;		/* length of name bytes */
	/*
	 * In Linux 6.5 this flex array was converted from name[1] to name[].
	 * Be very careful here about extra padding at the end; see
	 * xfs_attr_leaf_entsize_remote() for details.
	 */
	__u8	name[];			/* name bytes */
};

Will we ever have a merkle tree larger than 2^32-1 bytes in length?  If
that's possible, then either we shard the merkle tree, or we have to rev
the ondisk xfs_attr_leaf_name_remote structure.

I think we have to rev the format anyway, since with nrext64==1 we can
have attr fork extents that start above 2^32 blocks, and the codebase
will blindly truncate the 64-bit quantity returned by
xfs_bmap_first_unused.

--D
Christoph Hellwig May 8, 2024, 11:47 a.m. UTC | #4
On Tue, May 07, 2024 at 02:24:54PM -0700, Darrick J. Wong wrote:
> Since we know the size of the merkle data ahead of time, we could also
> preallocate space in the attr fork and create a remote ATTR_VERITY xattr
> named "merkle" that points to the allocated space.  Then we don't have
> to have magic meanings for the high bit.

Note that high bit was just an example, a random high offset
might be a better choice, sized with some space to spare for the maximum
verify data.

> Will we ever have a merkle tree larger than 2^32-1 bytes in length?  If
> that's possible, then either we shard the merkle tree, or we have to rev
> the ondisk xfs_attr_leaf_name_remote structure.

If we did that would be yet another indicator that they aren't attrs
but something else.  But maybe I should stop banging that drum and
agree that everything is a nail if all you got is a hammer.. :)

> 
> I think we have to rev the format anyway, since with nrext64==1 we can
> have attr fork extents that start above 2^32 blocks, and the codebase
> will blindly truncate the 64-bit quantity returned by
> xfs_bmap_first_unused.

Or we decide the space above 2^32 blocks can't be used by attrs,
and only by other users with other means of discover.  Say the
verify hashes..
Darrick J. Wong May 8, 2024, 8:26 p.m. UTC | #5
On Wed, May 08, 2024 at 04:47:32AM -0700, Christoph Hellwig wrote:
> On Tue, May 07, 2024 at 02:24:54PM -0700, Darrick J. Wong wrote:
> > Since we know the size of the merkle data ahead of time, we could also
> > preallocate space in the attr fork and create a remote ATTR_VERITY xattr
> > named "merkle" that points to the allocated space.  Then we don't have
> > to have magic meanings for the high bit.
> 
> Note that high bit was just an example, a random high offset
> might be a better choice, sized with some space to spare for the maximum
> verify data.

I guess we could make it really obvious by allocating range in the
mapping starting at MAX_FILEOFF and going downwards.  Chances are pretty
good that with the xattr info growing upwards they're never going to
meet.

> > Will we ever have a merkle tree larger than 2^32-1 bytes in length?  If
> > that's possible, then either we shard the merkle tree, or we have to rev
> > the ondisk xfs_attr_leaf_name_remote structure.
> 
> If we did that would be yet another indicator that they aren't attrs
> but something else.  But maybe I should stop banging that drum and
> agree that everything is a nail if all you got is a hammer.. :)

Hammer?  All I've got is a big block of cheese. :P

FWIW the fsverity code seems to cut us off at U32_MAX bytes of merkle
data so that's going to be the limit until they rev the ondisk format.

> > I think we have to rev the format anyway, since with nrext64==1 we can
> > have attr fork extents that start above 2^32 blocks, and the codebase
> > will blindly truncate the 64-bit quantity returned by
> > xfs_bmap_first_unused.
> 
> Or we decide the space above 2^32 blocks can't be used by attrs,
> and only by other users with other means of discover.  Say the
> verify hashes..

Well right now they can't be used by attrs because xfs_dablk_t isn't big
enough to fit a larger value.  The dangerous part here is that the code
silently truncates the outparam of xfs_bmap_first_unused, so I'll fix
that too.

--D
Christoph Hellwig May 9, 2024, 5:02 a.m. UTC | #6
On Wed, May 08, 2024 at 01:26:03PM -0700, Darrick J. Wong wrote:
> I guess we could make it really obvious by allocating range in the
> mapping starting at MAX_FILEOFF and going downwards.  Chances are pretty
> good that with the xattr info growing upwards they're never going to
> meet.

Yes, although I'd avoid taking chances.  More below.

> > Or we decide the space above 2^32 blocks can't be used by attrs,
> > and only by other users with other means of discover.  Say the
> > verify hashes..
> 
> Well right now they can't be used by attrs because xfs_dablk_t isn't big
> enough to fit a larger value.

Yes.

> The dangerous part here is that the code
> silently truncates the outparam of xfs_bmap_first_unused, so I'll fix
> that too.

Well, we should check for that in xfs_attr_rmt_find_hole /
xfs_da_grow_inode_int, totally independent of the fsverity work.
The condition is basically impossible to hit right now, but I'd rather
make sure we do have a solid check.  I'll prepare a patch for it.
Eric Biggers May 9, 2024, 5:46 p.m. UTC | #7
On Wed, May 08, 2024 at 01:26:03PM -0700, Darrick J. Wong wrote:
> > If we did that would be yet another indicator that they aren't attrs
> > but something else.  But maybe I should stop banging that drum and
> > agree that everything is a nail if all you got is a hammer.. :)
> 
> Hammer?  All I've got is a big block of cheese. :P
> 
> FWIW the fsverity code seems to cut us off at U32_MAX bytes of merkle
> data so that's going to be the limit until they rev the ondisk format.
> 

Where does that happen?

- Eric
Darrick J. Wong May 9, 2024, 6:04 p.m. UTC | #8
On Thu, May 09, 2024 at 10:46:52AM -0700, Eric Biggers wrote:
> On Wed, May 08, 2024 at 01:26:03PM -0700, Darrick J. Wong wrote:
> > > If we did that would be yet another indicator that they aren't attrs
> > > but something else.  But maybe I should stop banging that drum and
> > > agree that everything is a nail if all you got is a hammer.. :)
> > 
> > Hammer?  All I've got is a big block of cheese. :P
> > 
> > FWIW the fsverity code seems to cut us off at U32_MAX bytes of merkle
> > data so that's going to be the limit until they rev the ondisk format.
> > 
> 
> Where does that happen?

fsverity_init_merkle_tree_params has the following:

	/*
	 * With block_size != PAGE_SIZE, an in-memory bitmap will need to be
	 * allocated to track the "verified" status of hash blocks.  Don't allow
	 * this bitmap to get too large.  For now, limit it to 1 MiB, which
	 * limits the file size to about 4.4 TB with SHA-256 and 4K blocks.
	 *
	 * Together with the fact that the data, and thus also the Merkle tree,
	 * cannot have more than ULONG_MAX pages, this implies that hash block
	 * indices can always fit in an 'unsigned long'.  But to be safe, we
	 * explicitly check for that too.  Note, this is only for hash block
	 * indices; data block indices might not fit in an 'unsigned long'.
	 */
	if ((params->block_size != PAGE_SIZE && offset > 1 << 23) ||
	    offset > ULONG_MAX) {
		fsverity_err(inode, "Too many blocks in Merkle tree");
		err = -EFBIG;
		goto out_err;
	}

Hmm.  I didn't read this correctly -- the comment says ULONG_MAX pages,
not bytes.  I got confused by the units of @offset, because "u64"
doesn't really help me distinguish bytes, blocks, or pages. :(

OTOH looking at how @offset is computed, it seems to be the total number
of blocks in the merkle tree by the time we get here?

So I guess we actually /can/ create a very large (e.g. 2^33 blocks)
merkle tree on a 64-bit machine, which could then return -EFBIG on
32-bit?

My dumb btree geometry calculator seems to think that an 8EiB file with
a sha256 hash in 4k blocks would generate a 69,260,574,978MB merkle
tree, or roughly a 2^44 block merkle tree?

Ok I guess xfs fsverity really does need a substantial amount of attr
fork space then. :)

--D

> - Eric
>
Eric Biggers May 9, 2024, 6:36 p.m. UTC | #9
On Thu, May 09, 2024 at 11:04:27AM -0700, Darrick J. Wong wrote:
> On Thu, May 09, 2024 at 10:46:52AM -0700, Eric Biggers wrote:
> > On Wed, May 08, 2024 at 01:26:03PM -0700, Darrick J. Wong wrote:
> > > > If we did that would be yet another indicator that they aren't attrs
> > > > but something else.  But maybe I should stop banging that drum and
> > > > agree that everything is a nail if all you got is a hammer.. :)
> > > 
> > > Hammer?  All I've got is a big block of cheese. :P
> > > 
> > > FWIW the fsverity code seems to cut us off at U32_MAX bytes of merkle
> > > data so that's going to be the limit until they rev the ondisk format.
> > > 
> > 
> > Where does that happen?
> 
> fsverity_init_merkle_tree_params has the following:
> 
> 	/*
> 	 * With block_size != PAGE_SIZE, an in-memory bitmap will need to be
> 	 * allocated to track the "verified" status of hash blocks.  Don't allow
> 	 * this bitmap to get too large.  For now, limit it to 1 MiB, which
> 	 * limits the file size to about 4.4 TB with SHA-256 and 4K blocks.
> 	 *
> 	 * Together with the fact that the data, and thus also the Merkle tree,
> 	 * cannot have more than ULONG_MAX pages, this implies that hash block
> 	 * indices can always fit in an 'unsigned long'.  But to be safe, we
> 	 * explicitly check for that too.  Note, this is only for hash block
> 	 * indices; data block indices might not fit in an 'unsigned long'.
> 	 */
> 	if ((params->block_size != PAGE_SIZE && offset > 1 << 23) ||
> 	    offset > ULONG_MAX) {
> 		fsverity_err(inode, "Too many blocks in Merkle tree");
> 		err = -EFBIG;
> 		goto out_err;
> 	}
> 
> Hmm.  I didn't read this correctly -- the comment says ULONG_MAX pages,
> not bytes.  I got confused by the units of @offset, because "u64"
> doesn't really help me distinguish bytes, blocks, or pages. :(
> 
> OTOH looking at how @offset is computed, it seems to be the total number
> of blocks in the merkle tree by the time we get here?

Yes, it's blocks here.

> So I guess we actually /can/ create a very large (e.g. 2^33 blocks)
> merkle tree on a 64-bit machine, which could then return -EFBIG on
> 32-bit?

Sure, but the page cache is indexed with unsigned long, and there are more data
pages than Merkle tree blocks, so that becomes a problem first.  That's why
fs/verity/ uses unsigned long for Merkle tree block indices.

> My dumb btree geometry calculator seems to think that an 8EiB file with
> a sha256 hash in 4k blocks would generate a 69,260,574,978MB merkle
> tree, or roughly a 2^44 block merkle tree?
> 
> Ok I guess xfs fsverity really does need a substantial amount of attr
> fork space then. :)

- Eric
Darrick J. Wong May 9, 2024, 8:02 p.m. UTC | #10
On Wed, May 08, 2024 at 10:02:52PM -0700, Christoph Hellwig wrote:
> On Wed, May 08, 2024 at 01:26:03PM -0700, Darrick J. Wong wrote:
> > I guess we could make it really obvious by allocating range in the
> > mapping starting at MAX_FILEOFF and going downwards.  Chances are pretty
> > good that with the xattr info growing upwards they're never going to
> > meet.
> 
> Yes, although I'd avoid taking chances.  More below.
> 
> > > Or we decide the space above 2^32 blocks can't be used by attrs,
> > > and only by other users with other means of discover.  Say the
> > > verify hashes..
> > 
> > Well right now they can't be used by attrs because xfs_dablk_t isn't big
> > enough to fit a larger value.
> 
> Yes.

Thinking about this further, I think building the merkle tree becomes a
lot more difficult than the current design.  At first I thought of
reserving a static partition in the attr fork address range, but got
bogged donw in figuring out how big the static partition has to be.

Downthread I realized that the maximum size of a merkle tree is actually
ULONG_MAX blocks, which means that on a 64-bit machine there effectively
is no limit.

For an 8EB file with sha256 hashes (32 bytes) and 4k blocks, we need
2^(63-12) hashes.  Assuming maximum loading factor of 128 hashes per
block, btheight spits out:

# xfs_db /dev/sdf -c 'btheight -b 4096 merkle_sha256 -n '$(( 2 ** (63 - 12) ))
merkle_sha256: best case per 4096-byte block: 128 records (leaf) / 128 keyptrs (node)
level 0: 2251799813685248 records, 17592186044416 blocks
level 1: 17592186044416 records, 137438953472 blocks
level 2: 137438953472 records, 1073741824 blocks
level 3: 1073741824 records, 8388608 blocks
level 4: 8388608 records, 65536 blocks
level 5: 65536 records, 512 blocks
level 6: 512 records, 4 blocks
level 7: 4 records, 1 block
8 levels, 17730707194373 blocks total

That's about 2^45 blocks.  If the hash is sha512, double those figures.
For a 1k block size, we get:

# xfs_db /dev/sdf -c 'btheight -b 1024 merkle_sha256 -n '$(( 2 ** (63 - 10) ))
merkle_sha256: best case per 1024-byte block: 32 records (leaf) / 32 keyptrs (node)
level 0: 9007199254740992 records, 281474976710656 blocks
level 1: 281474976710656 records, 8796093022208 blocks
level 2: 8796093022208 records, 274877906944 blocks
level 3: 274877906944 records, 8589934592 blocks
level 4: 8589934592 records, 268435456 blocks
level 5: 268435456 records, 8388608 blocks
level 6: 8388608 records, 262144 blocks
level 7: 262144 records, 8192 blocks
level 8: 8192 records, 256 blocks
level 9: 256 records, 8 blocks
level 10: 8 records, 1 block
11 levels, 290554814669065 blocks total

That would be 2^49 blocks but mercifully fsverity doesn't allow more
than 8 levels of tree.

So I don't think it's a good idea to create a hardcoded partition in the
attr fork for merkle tree data, since it would have to be absurdly large
for the common case of sub-1T files:

# xfs_db /dev/sdf -c 'btheight -b 4096 merkle_sha512 -n '$(( 2 ** (40 - 12) ))
merkle_sha512: best case per 4096-byte block: 64 records (leaf) / 64 keyptrs (node)
level 0: 268435456 records, 4194304 blocks
level 1: 4194304 records, 65536 blocks
level 2: 65536 records, 1024 blocks
level 3: 1024 records, 16 blocks
level 4: 16 records, 1 block
5 levels, 4260881 blocks total

That led me to the idea of dynamic partitioning, where we find a sparse
part of the attr fork fileoff range and use that.  That burns a lot less
address range but means that we cannot elide merkle tree blocks that
contain entirely hash(zeroes) because elided blocks become sparse holes
in the attr fork, and xfs_bmap_first_unused can still find those holes.
I guess you could mark those blocks as unwritten, but that wastes space.

Setting even /that/ aside, how would we allocate/map the range?  I think
we'd want a second ATTR_VERITY attr to claim ownership of whatever attr
fork range we found.  xfs_fsverity_write_merkle would have to do
something like this, pretending that the merkle tree blocksize matches
the fs blocksize:

	offset = /* merkle tree pos to attr fork xfs_fileoff_t */
	xfs_trans_alloc()

	xfs_bmapi_write(..., offset, 1...);

	xfs_trans_buf_get()
	/* copy merkle tree contents */
	xfs_trans_log_buf()

	/* update verity extent attr value */
	xfs_attr_defer_add("verity.merkle",
			{fileoff: /* start of range */,
			 blockcount: /* blocks mapped so far */
			});

	xfs_trans_commit()

Note that xfs_fsverity_begin_enable would have to ensure that there's an
attr fork and that it's not in local format.  On the plus side, doing
all this transactionally means we have a second user of logged xattr
updates. :P

Online repair would need to grow new code to copy the merkle tree.

Tearing down the merkle tree (aka if tree setup fails and someone wants
to try again) use the "verity.merkle" attr to figure out which blocks to
clobber.

Caching at least is pretty easy, look up the "verity.merkle" attribute
to find the fileoff, compute the fileoff of the particular block we want
in the attr fork, xfs_buf_read the buffer, and toss the contents in the
incore cache that we have now.

<shrug> What do you think of that?

--D

> > The dangerous part here is that the code
> > silently truncates the outparam of xfs_bmap_first_unused, so I'll fix
> > that too.
> 
> Well, we should check for that in xfs_attr_rmt_find_hole /
> xfs_da_grow_inode_int, totally independent of the fsverity work.
> The condition is basically impossible to hit right now, but I'd rather
> make sure we do have a solid check.  I'll prepare a patch for it.
> 
>
Christoph Hellwig May 10, 2024, 5:08 a.m. UTC | #11
On Thu, May 09, 2024 at 01:02:50PM -0700, Darrick J. Wong wrote:
> Thinking about this further, I think building the merkle tree becomes a
> lot more difficult than the current design.  At first I thought of
> reserving a static partition in the attr fork address range, but got
> bogged donw in figuring out how big the static partition has to be.
> 
> Downthread I realized that the maximum size of a merkle tree is actually
> ULONG_MAX blocks, which means that on a 64-bit machine there effectively
> is no limit.

Do we care about using up the limit?  Remember that in ext4/f2fs
the merkle tree is stored in what XFS calls the data fork,
so the file data plus the merkle tree have to fit into the size
limit, be that S64_MAX or lower limit imposed by the page cache.
And besides being the limit imposed by the current most common
implementation (I haven't checked btrfs as the only other one),
that does seem like a pretty reasonable one.

> That led me to the idea of dynamic partitioning, where we find a sparse
> part of the attr fork fileoff range and use that.  That burns a lot less
> address range but means that we cannot elide merkle tree blocks that
> contain entirely hash(zeroes) because elided blocks become sparse holes
> in the attr fork, and xfs_bmap_first_unused can still find those holes.

xfs_bmap_first_unused currently finds them.  It should not as it's
callers are limited to 32-bit addressing.  I'll send a patch to make
that clear. 

> Setting even /that/ aside, how would we allocate/map the range?

IFF we stick to a static range (which I think still make sense),
that range would be statically reserved and should exist if the
VERITY bit is set on the inode, and the size is calculated from
the file size.  If not we'd indeed need to record the mapping
somewhere, and an attr would be the right place.  It still feels
like going down a rabit hole for no obvious benefit to me.
Christoph Hellwig May 10, 2024, 6:20 a.m. UTC | #12
FYI, I spent some time looking over the core verity and ext4 code,
and I can't find anything enforcing any kind of size limit.  Of course
testing that is kinda hard without taking sparseness into account.

Eric, should fsverity or the fs backend check for a max size instead
od trying to build the merkle tree and evnetually failing to write it
out?

An interesting note I found in the ext4 code is:

  Note that the verity metadata *must* be encrypted when the file is,
  since it contains hashes of the plaintext data.

While xfs doesn't currently support fscrypyt it would actually be very
useful feature, so we're locking us into encrypting attrs or at least
magic attr fork data if we do our own non-standard fsverity storage.
I'm getting less and less happy with not just doing the normal post
i_size storage.  Yes, it's not pretty (so isn't the whole fsverity idea
of shoehorning the hashes into file systems not built for it), but it
avoid adding tons of code and beeing very different.
Darrick J. Wong May 17, 2024, 5:17 p.m. UTC | #13
On Thu, May 09, 2024 at 11:20:17PM -0700, Christoph Hellwig wrote:
> FYI, I spent some time looking over the core verity and ext4 code,
> and I can't find anything enforcing any kind of size limit.  Of course
> testing that is kinda hard without taking sparseness into account.
> 
> Eric, should fsverity or the fs backend check for a max size instead
> od trying to build the merkle tree and evnetually failing to write it
> out?
> 
> An interesting note I found in the ext4 code is:
> 
>   Note that the verity metadata *must* be encrypted when the file is,
>   since it contains hashes of the plaintext data.

Refresh my memory of fscrypt -- does it encrypt directory names, xattr
names, and xattr values too?  Or does it only do that to file data?

> While xfs doesn't currently support fscrypyt it would actually be very
> useful feature, so we're locking us into encrypting attrs or at least
> magic attr fork data if we do our own non-standard fsverity storage.
> I'm getting less and less happy with not just doing the normal post
> i_size storage.  Yes, it's not pretty (so isn't the whole fsverity idea
> of shoehorning the hashes into file systems not built for it), but it
> avoid adding tons of code and beeing very different.

And if we copy the ext4 method of putting the merkle data after eof and
loading it into the pagecache, how much of the generic fs/verity cleanup
patches do we really need?

--D
diff mbox series

Patch

diff --git a/fs/xfs/libxfs/xfs_attr.c b/fs/xfs/libxfs/xfs_attr.c
index 953a82d70223e..d21a743f90ea7 100644
--- a/fs/xfs/libxfs/xfs_attr.c
+++ b/fs/xfs/libxfs/xfs_attr.c
@@ -462,6 +462,8 @@  xfs_attr_hashval(
 
 	if (attr_flags & XFS_ATTR_PARENT)
 		return xfs_parent_hashattr(mp, name, namelen, value, valuelen);
+	if (attr_flags & XFS_ATTR_VERITY)
+		return xfs_verity_hashname(name, namelen);
 
 	return xfs_attr_hashname(name, namelen);
 }
diff --git a/fs/xfs/libxfs/xfs_da_format.h b/fs/xfs/libxfs/xfs_da_format.h
index 43e9d1f00a4ab..c95e8ca22daad 100644
--- a/fs/xfs/libxfs/xfs_da_format.h
+++ b/fs/xfs/libxfs/xfs_da_format.h
@@ -943,4 +943,10 @@  struct xfs_merkle_key {
 #define XFS_VERITY_DESCRIPTOR_NAME	"vdesc"
 #define XFS_VERITY_DESCRIPTOR_NAME_LEN	(sizeof(XFS_VERITY_DESCRIPTOR_NAME) - 1)
 
+/*
+ * Merkle tree blocks cannot be smaller than 1k in size, so the hash function
+ * can right-shift the merkle offset by this amount without losing anything.
+ */
+#define XFS_VERITY_HASH_SHIFT		(10)
+
 #endif /* __XFS_DA_FORMAT_H__ */
diff --git a/fs/xfs/libxfs/xfs_verity.c b/fs/xfs/libxfs/xfs_verity.c
index ff02c5c840b58..8c470014b915c 100644
--- a/fs/xfs/libxfs/xfs_verity.c
+++ b/fs/xfs/libxfs/xfs_verity.c
@@ -56,3 +56,19 @@  xfs_verity_namecheck(
 
 	return true;
 }
+
+/*
+ * Compute name hash for a verity attribute.  For merkle tree blocks, we want
+ * to use the merkle tree block offset as the hash value to avoid collisions
+ * between blocks unless the merkle tree becomes larger than 2^32 blocks.
+ */
+xfs_dahash_t
+xfs_verity_hashname(
+	const uint8_t		*name,
+	unsigned int		namelen)
+{
+	if (namelen != sizeof(struct xfs_merkle_key))
+		return xfs_attr_hashname(name, namelen);
+
+	return xfs_merkle_key_from_disk(name, namelen) >> XFS_VERITY_HASH_SHIFT;
+}
diff --git a/fs/xfs/libxfs/xfs_verity.h b/fs/xfs/libxfs/xfs_verity.h
index 5813665c5a01e..3d7485c511d58 100644
--- a/fs/xfs/libxfs/xfs_verity.h
+++ b/fs/xfs/libxfs/xfs_verity.h
@@ -9,5 +9,6 @@  void xfs_merkle_key_to_disk(struct xfs_merkle_key *key, uint64_t pos);
 uint64_t xfs_merkle_key_from_disk(const void *attr_name, int namelen);
 bool xfs_verity_namecheck(unsigned int attr_flags, const void *name,
 		int namelen);
+xfs_dahash_t xfs_verity_hashname(const uint8_t *name, unsigned int namelen);
 
 #endif	/* __XFS_VERITY_H__ */