Message ID | 171444680671.957659.2149857258719599236.stgit@frogsfrogsfrogs (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [01/26] xfs: use unsigned ints for non-negative quantities in xfs_attr_remote.c | expand |
On Mon, Apr 29, 2024 at 08:28:48PM -0700, Darrick J. Wong wrote: > Within just this attr leaf block, there are 76 attr entries, but only 38 > distinct hash values. There are 415 merkle tree blocks for this file, > but we already have hash collisions. This isn't good performance from > the standard da hash function because we're mostly shifting and rolling > zeroes around. > > However, we don't even have to do that much work -- the merkle tree > block keys are themslves u64 values. Truncate that value to 32 bits > (the size of xfs_dahash_t) and use that for the hash. We won't have any > collisions between merkle tree blocks until that tree grows to 2^32nd > blocks. On a 4k block filesystem, we won't hit that unless the file > contains more than 2^49 bytes, assuming sha256. > > As a side effect, the keys for merkle tree blocks get written out in > roughly sequential order, though I didn't observe any change in > performance. This and the header hacks suggest to me that shoe horning the fsverity blocks into attrs just feels like the wrong approach. They don't really behave like attrs, they aren't key/value paris that are separate, but a large amount of same sized blocks with logical indexing. All that is actually nicely solved by the original fsverity used by ext4/f2fs, while we have to pile workarounds ontop of workarounds to make attrs work.
On Tue, Apr 30, 2024 at 11:53:00PM -0700, Christoph Hellwig wrote: > This and the header hacks suggest to me that shoe horning the fsverity > blocks into attrs just feels like the wrong approach. > > They don't really behave like attrs, they aren't key/value paris that > are separate, but a large amount of same sized blocks with logical > indexing. All that is actually nicely solved by the original fsverity > used by ext4/f2fs, while we have to pile workarounds ontop of > workarounds to make attrs work. Taking this a bit further: If we want to avoid the problems associated with the original scheme, mostly the file size limitation, and the (IMHO more cosmetic than real) confusion with post-EOF preallocations, we can still store the data in the attr fork, but not in the traditional attr format. The attr fork provides the logical index to physical translation as the data fork, and while that is current only used for dabtree blocks and remote attr values, that isn't actually a fundamental requirement for using it. All the attr fork placement works through xfs_bmap_first_unused() to find completely random free space in the logic address space. Now if we reserved say the high bit for verity blocks in verity enabled file systems we can simply use the bmap btree to do the mapping from the verity index to the on-disk verify blocks without any other impact to the attr code.
On Wed, May 01, 2024 at 12:23:02AM -0700, Christoph Hellwig wrote: > On Tue, Apr 30, 2024 at 11:53:00PM -0700, Christoph Hellwig wrote: > > This and the header hacks suggest to me that shoe horning the fsverity > > blocks into attrs just feels like the wrong approach. > > > > They don't really behave like attrs, they aren't key/value paris that > > are separate, but a large amount of same sized blocks with logical > > indexing. All that is actually nicely solved by the original fsverity > > used by ext4/f2fs, while we have to pile workarounds ontop of > > workarounds to make attrs work. > > Taking this a bit further: If we want to avoid the problems associated > with the original scheme, mostly the file size limitation, and the (IMHO > more cosmetic than real) confusion with post-EOF preallocations, we > can still store the data in the attr fork, but not in the traditional > attr format. The attr fork provides the logical index to physical > translation as the data fork, and while that is current only used for > dabtree blocks and remote attr values, that isn't actually a fundamental > requirement for using it. > > All the attr fork placement works through xfs_bmap_first_unused() to > find completely random free space in the logic address space. > > Now if we reserved say the high bit for verity blocks in verity enabled > file systems we can simply use the bmap btree to do the mapping from > the verity index to the on-disk verify blocks without any other impact > to the attr code. Since we know the size of the merkle data ahead of time, we could also preallocate space in the attr fork and create a remote ATTR_VERITY xattr named "merkle" that points to the allocated space. Then we don't have to have magic meanings for the high bit. Though I guess the question is, given the format: struct xfs_attr_leaf_name_remote { __be32 valueblk; /* block number of value bytes */ __be32 valuelen; /* number of bytes in value */ __u8 namelen; /* length of name bytes */ /* * In Linux 6.5 this flex array was converted from name[1] to name[]. * Be very careful here about extra padding at the end; see * xfs_attr_leaf_entsize_remote() for details. */ __u8 name[]; /* name bytes */ }; Will we ever have a merkle tree larger than 2^32-1 bytes in length? If that's possible, then either we shard the merkle tree, or we have to rev the ondisk xfs_attr_leaf_name_remote structure. I think we have to rev the format anyway, since with nrext64==1 we can have attr fork extents that start above 2^32 blocks, and the codebase will blindly truncate the 64-bit quantity returned by xfs_bmap_first_unused. --D
On Tue, May 07, 2024 at 02:24:54PM -0700, Darrick J. Wong wrote: > Since we know the size of the merkle data ahead of time, we could also > preallocate space in the attr fork and create a remote ATTR_VERITY xattr > named "merkle" that points to the allocated space. Then we don't have > to have magic meanings for the high bit. Note that high bit was just an example, a random high offset might be a better choice, sized with some space to spare for the maximum verify data. > Will we ever have a merkle tree larger than 2^32-1 bytes in length? If > that's possible, then either we shard the merkle tree, or we have to rev > the ondisk xfs_attr_leaf_name_remote structure. If we did that would be yet another indicator that they aren't attrs but something else. But maybe I should stop banging that drum and agree that everything is a nail if all you got is a hammer.. :) > > I think we have to rev the format anyway, since with nrext64==1 we can > have attr fork extents that start above 2^32 blocks, and the codebase > will blindly truncate the 64-bit quantity returned by > xfs_bmap_first_unused. Or we decide the space above 2^32 blocks can't be used by attrs, and only by other users with other means of discover. Say the verify hashes..
On Wed, May 08, 2024 at 04:47:32AM -0700, Christoph Hellwig wrote: > On Tue, May 07, 2024 at 02:24:54PM -0700, Darrick J. Wong wrote: > > Since we know the size of the merkle data ahead of time, we could also > > preallocate space in the attr fork and create a remote ATTR_VERITY xattr > > named "merkle" that points to the allocated space. Then we don't have > > to have magic meanings for the high bit. > > Note that high bit was just an example, a random high offset > might be a better choice, sized with some space to spare for the maximum > verify data. I guess we could make it really obvious by allocating range in the mapping starting at MAX_FILEOFF and going downwards. Chances are pretty good that with the xattr info growing upwards they're never going to meet. > > Will we ever have a merkle tree larger than 2^32-1 bytes in length? If > > that's possible, then either we shard the merkle tree, or we have to rev > > the ondisk xfs_attr_leaf_name_remote structure. > > If we did that would be yet another indicator that they aren't attrs > but something else. But maybe I should stop banging that drum and > agree that everything is a nail if all you got is a hammer.. :) Hammer? All I've got is a big block of cheese. :P FWIW the fsverity code seems to cut us off at U32_MAX bytes of merkle data so that's going to be the limit until they rev the ondisk format. > > I think we have to rev the format anyway, since with nrext64==1 we can > > have attr fork extents that start above 2^32 blocks, and the codebase > > will blindly truncate the 64-bit quantity returned by > > xfs_bmap_first_unused. > > Or we decide the space above 2^32 blocks can't be used by attrs, > and only by other users with other means of discover. Say the > verify hashes.. Well right now they can't be used by attrs because xfs_dablk_t isn't big enough to fit a larger value. The dangerous part here is that the code silently truncates the outparam of xfs_bmap_first_unused, so I'll fix that too. --D
On Wed, May 08, 2024 at 01:26:03PM -0700, Darrick J. Wong wrote: > I guess we could make it really obvious by allocating range in the > mapping starting at MAX_FILEOFF and going downwards. Chances are pretty > good that with the xattr info growing upwards they're never going to > meet. Yes, although I'd avoid taking chances. More below. > > Or we decide the space above 2^32 blocks can't be used by attrs, > > and only by other users with other means of discover. Say the > > verify hashes.. > > Well right now they can't be used by attrs because xfs_dablk_t isn't big > enough to fit a larger value. Yes. > The dangerous part here is that the code > silently truncates the outparam of xfs_bmap_first_unused, so I'll fix > that too. Well, we should check for that in xfs_attr_rmt_find_hole / xfs_da_grow_inode_int, totally independent of the fsverity work. The condition is basically impossible to hit right now, but I'd rather make sure we do have a solid check. I'll prepare a patch for it.
On Wed, May 08, 2024 at 01:26:03PM -0700, Darrick J. Wong wrote: > > If we did that would be yet another indicator that they aren't attrs > > but something else. But maybe I should stop banging that drum and > > agree that everything is a nail if all you got is a hammer.. :) > > Hammer? All I've got is a big block of cheese. :P > > FWIW the fsverity code seems to cut us off at U32_MAX bytes of merkle > data so that's going to be the limit until they rev the ondisk format. > Where does that happen? - Eric
On Thu, May 09, 2024 at 10:46:52AM -0700, Eric Biggers wrote: > On Wed, May 08, 2024 at 01:26:03PM -0700, Darrick J. Wong wrote: > > > If we did that would be yet another indicator that they aren't attrs > > > but something else. But maybe I should stop banging that drum and > > > agree that everything is a nail if all you got is a hammer.. :) > > > > Hammer? All I've got is a big block of cheese. :P > > > > FWIW the fsverity code seems to cut us off at U32_MAX bytes of merkle > > data so that's going to be the limit until they rev the ondisk format. > > > > Where does that happen? fsverity_init_merkle_tree_params has the following: /* * With block_size != PAGE_SIZE, an in-memory bitmap will need to be * allocated to track the "verified" status of hash blocks. Don't allow * this bitmap to get too large. For now, limit it to 1 MiB, which * limits the file size to about 4.4 TB with SHA-256 and 4K blocks. * * Together with the fact that the data, and thus also the Merkle tree, * cannot have more than ULONG_MAX pages, this implies that hash block * indices can always fit in an 'unsigned long'. But to be safe, we * explicitly check for that too. Note, this is only for hash block * indices; data block indices might not fit in an 'unsigned long'. */ if ((params->block_size != PAGE_SIZE && offset > 1 << 23) || offset > ULONG_MAX) { fsverity_err(inode, "Too many blocks in Merkle tree"); err = -EFBIG; goto out_err; } Hmm. I didn't read this correctly -- the comment says ULONG_MAX pages, not bytes. I got confused by the units of @offset, because "u64" doesn't really help me distinguish bytes, blocks, or pages. :( OTOH looking at how @offset is computed, it seems to be the total number of blocks in the merkle tree by the time we get here? So I guess we actually /can/ create a very large (e.g. 2^33 blocks) merkle tree on a 64-bit machine, which could then return -EFBIG on 32-bit? My dumb btree geometry calculator seems to think that an 8EiB file with a sha256 hash in 4k blocks would generate a 69,260,574,978MB merkle tree, or roughly a 2^44 block merkle tree? Ok I guess xfs fsverity really does need a substantial amount of attr fork space then. :) --D > - Eric >
On Thu, May 09, 2024 at 11:04:27AM -0700, Darrick J. Wong wrote: > On Thu, May 09, 2024 at 10:46:52AM -0700, Eric Biggers wrote: > > On Wed, May 08, 2024 at 01:26:03PM -0700, Darrick J. Wong wrote: > > > > If we did that would be yet another indicator that they aren't attrs > > > > but something else. But maybe I should stop banging that drum and > > > > agree that everything is a nail if all you got is a hammer.. :) > > > > > > Hammer? All I've got is a big block of cheese. :P > > > > > > FWIW the fsverity code seems to cut us off at U32_MAX bytes of merkle > > > data so that's going to be the limit until they rev the ondisk format. > > > > > > > Where does that happen? > > fsverity_init_merkle_tree_params has the following: > > /* > * With block_size != PAGE_SIZE, an in-memory bitmap will need to be > * allocated to track the "verified" status of hash blocks. Don't allow > * this bitmap to get too large. For now, limit it to 1 MiB, which > * limits the file size to about 4.4 TB with SHA-256 and 4K blocks. > * > * Together with the fact that the data, and thus also the Merkle tree, > * cannot have more than ULONG_MAX pages, this implies that hash block > * indices can always fit in an 'unsigned long'. But to be safe, we > * explicitly check for that too. Note, this is only for hash block > * indices; data block indices might not fit in an 'unsigned long'. > */ > if ((params->block_size != PAGE_SIZE && offset > 1 << 23) || > offset > ULONG_MAX) { > fsverity_err(inode, "Too many blocks in Merkle tree"); > err = -EFBIG; > goto out_err; > } > > Hmm. I didn't read this correctly -- the comment says ULONG_MAX pages, > not bytes. I got confused by the units of @offset, because "u64" > doesn't really help me distinguish bytes, blocks, or pages. :( > > OTOH looking at how @offset is computed, it seems to be the total number > of blocks in the merkle tree by the time we get here? Yes, it's blocks here. > So I guess we actually /can/ create a very large (e.g. 2^33 blocks) > merkle tree on a 64-bit machine, which could then return -EFBIG on > 32-bit? Sure, but the page cache is indexed with unsigned long, and there are more data pages than Merkle tree blocks, so that becomes a problem first. That's why fs/verity/ uses unsigned long for Merkle tree block indices. > My dumb btree geometry calculator seems to think that an 8EiB file with > a sha256 hash in 4k blocks would generate a 69,260,574,978MB merkle > tree, or roughly a 2^44 block merkle tree? > > Ok I guess xfs fsverity really does need a substantial amount of attr > fork space then. :) - Eric
On Wed, May 08, 2024 at 10:02:52PM -0700, Christoph Hellwig wrote: > On Wed, May 08, 2024 at 01:26:03PM -0700, Darrick J. Wong wrote: > > I guess we could make it really obvious by allocating range in the > > mapping starting at MAX_FILEOFF and going downwards. Chances are pretty > > good that with the xattr info growing upwards they're never going to > > meet. > > Yes, although I'd avoid taking chances. More below. > > > > Or we decide the space above 2^32 blocks can't be used by attrs, > > > and only by other users with other means of discover. Say the > > > verify hashes.. > > > > Well right now they can't be used by attrs because xfs_dablk_t isn't big > > enough to fit a larger value. > > Yes. Thinking about this further, I think building the merkle tree becomes a lot more difficult than the current design. At first I thought of reserving a static partition in the attr fork address range, but got bogged donw in figuring out how big the static partition has to be. Downthread I realized that the maximum size of a merkle tree is actually ULONG_MAX blocks, which means that on a 64-bit machine there effectively is no limit. For an 8EB file with sha256 hashes (32 bytes) and 4k blocks, we need 2^(63-12) hashes. Assuming maximum loading factor of 128 hashes per block, btheight spits out: # xfs_db /dev/sdf -c 'btheight -b 4096 merkle_sha256 -n '$(( 2 ** (63 - 12) )) merkle_sha256: best case per 4096-byte block: 128 records (leaf) / 128 keyptrs (node) level 0: 2251799813685248 records, 17592186044416 blocks level 1: 17592186044416 records, 137438953472 blocks level 2: 137438953472 records, 1073741824 blocks level 3: 1073741824 records, 8388608 blocks level 4: 8388608 records, 65536 blocks level 5: 65536 records, 512 blocks level 6: 512 records, 4 blocks level 7: 4 records, 1 block 8 levels, 17730707194373 blocks total That's about 2^45 blocks. If the hash is sha512, double those figures. For a 1k block size, we get: # xfs_db /dev/sdf -c 'btheight -b 1024 merkle_sha256 -n '$(( 2 ** (63 - 10) )) merkle_sha256: best case per 1024-byte block: 32 records (leaf) / 32 keyptrs (node) level 0: 9007199254740992 records, 281474976710656 blocks level 1: 281474976710656 records, 8796093022208 blocks level 2: 8796093022208 records, 274877906944 blocks level 3: 274877906944 records, 8589934592 blocks level 4: 8589934592 records, 268435456 blocks level 5: 268435456 records, 8388608 blocks level 6: 8388608 records, 262144 blocks level 7: 262144 records, 8192 blocks level 8: 8192 records, 256 blocks level 9: 256 records, 8 blocks level 10: 8 records, 1 block 11 levels, 290554814669065 blocks total That would be 2^49 blocks but mercifully fsverity doesn't allow more than 8 levels of tree. So I don't think it's a good idea to create a hardcoded partition in the attr fork for merkle tree data, since it would have to be absurdly large for the common case of sub-1T files: # xfs_db /dev/sdf -c 'btheight -b 4096 merkle_sha512 -n '$(( 2 ** (40 - 12) )) merkle_sha512: best case per 4096-byte block: 64 records (leaf) / 64 keyptrs (node) level 0: 268435456 records, 4194304 blocks level 1: 4194304 records, 65536 blocks level 2: 65536 records, 1024 blocks level 3: 1024 records, 16 blocks level 4: 16 records, 1 block 5 levels, 4260881 blocks total That led me to the idea of dynamic partitioning, where we find a sparse part of the attr fork fileoff range and use that. That burns a lot less address range but means that we cannot elide merkle tree blocks that contain entirely hash(zeroes) because elided blocks become sparse holes in the attr fork, and xfs_bmap_first_unused can still find those holes. I guess you could mark those blocks as unwritten, but that wastes space. Setting even /that/ aside, how would we allocate/map the range? I think we'd want a second ATTR_VERITY attr to claim ownership of whatever attr fork range we found. xfs_fsverity_write_merkle would have to do something like this, pretending that the merkle tree blocksize matches the fs blocksize: offset = /* merkle tree pos to attr fork xfs_fileoff_t */ xfs_trans_alloc() xfs_bmapi_write(..., offset, 1...); xfs_trans_buf_get() /* copy merkle tree contents */ xfs_trans_log_buf() /* update verity extent attr value */ xfs_attr_defer_add("verity.merkle", {fileoff: /* start of range */, blockcount: /* blocks mapped so far */ }); xfs_trans_commit() Note that xfs_fsverity_begin_enable would have to ensure that there's an attr fork and that it's not in local format. On the plus side, doing all this transactionally means we have a second user of logged xattr updates. :P Online repair would need to grow new code to copy the merkle tree. Tearing down the merkle tree (aka if tree setup fails and someone wants to try again) use the "verity.merkle" attr to figure out which blocks to clobber. Caching at least is pretty easy, look up the "verity.merkle" attribute to find the fileoff, compute the fileoff of the particular block we want in the attr fork, xfs_buf_read the buffer, and toss the contents in the incore cache that we have now. <shrug> What do you think of that? --D > > The dangerous part here is that the code > > silently truncates the outparam of xfs_bmap_first_unused, so I'll fix > > that too. > > Well, we should check for that in xfs_attr_rmt_find_hole / > xfs_da_grow_inode_int, totally independent of the fsverity work. > The condition is basically impossible to hit right now, but I'd rather > make sure we do have a solid check. I'll prepare a patch for it. > >
On Thu, May 09, 2024 at 01:02:50PM -0700, Darrick J. Wong wrote: > Thinking about this further, I think building the merkle tree becomes a > lot more difficult than the current design. At first I thought of > reserving a static partition in the attr fork address range, but got > bogged donw in figuring out how big the static partition has to be. > > Downthread I realized that the maximum size of a merkle tree is actually > ULONG_MAX blocks, which means that on a 64-bit machine there effectively > is no limit. Do we care about using up the limit? Remember that in ext4/f2fs the merkle tree is stored in what XFS calls the data fork, so the file data plus the merkle tree have to fit into the size limit, be that S64_MAX or lower limit imposed by the page cache. And besides being the limit imposed by the current most common implementation (I haven't checked btrfs as the only other one), that does seem like a pretty reasonable one. > That led me to the idea of dynamic partitioning, where we find a sparse > part of the attr fork fileoff range and use that. That burns a lot less > address range but means that we cannot elide merkle tree blocks that > contain entirely hash(zeroes) because elided blocks become sparse holes > in the attr fork, and xfs_bmap_first_unused can still find those holes. xfs_bmap_first_unused currently finds them. It should not as it's callers are limited to 32-bit addressing. I'll send a patch to make that clear. > Setting even /that/ aside, how would we allocate/map the range? IFF we stick to a static range (which I think still make sense), that range would be statically reserved and should exist if the VERITY bit is set on the inode, and the size is calculated from the file size. If not we'd indeed need to record the mapping somewhere, and an attr would be the right place. It still feels like going down a rabit hole for no obvious benefit to me.
FYI, I spent some time looking over the core verity and ext4 code, and I can't find anything enforcing any kind of size limit. Of course testing that is kinda hard without taking sparseness into account. Eric, should fsverity or the fs backend check for a max size instead od trying to build the merkle tree and evnetually failing to write it out? An interesting note I found in the ext4 code is: Note that the verity metadata *must* be encrypted when the file is, since it contains hashes of the plaintext data. While xfs doesn't currently support fscrypyt it would actually be very useful feature, so we're locking us into encrypting attrs or at least magic attr fork data if we do our own non-standard fsverity storage. I'm getting less and less happy with not just doing the normal post i_size storage. Yes, it's not pretty (so isn't the whole fsverity idea of shoehorning the hashes into file systems not built for it), but it avoid adding tons of code and beeing very different.
On Thu, May 09, 2024 at 11:20:17PM -0700, Christoph Hellwig wrote: > FYI, I spent some time looking over the core verity and ext4 code, > and I can't find anything enforcing any kind of size limit. Of course > testing that is kinda hard without taking sparseness into account. > > Eric, should fsverity or the fs backend check for a max size instead > od trying to build the merkle tree and evnetually failing to write it > out? > > An interesting note I found in the ext4 code is: > > Note that the verity metadata *must* be encrypted when the file is, > since it contains hashes of the plaintext data. Refresh my memory of fscrypt -- does it encrypt directory names, xattr names, and xattr values too? Or does it only do that to file data? > While xfs doesn't currently support fscrypyt it would actually be very > useful feature, so we're locking us into encrypting attrs or at least > magic attr fork data if we do our own non-standard fsverity storage. > I'm getting less and less happy with not just doing the normal post > i_size storage. Yes, it's not pretty (so isn't the whole fsverity idea > of shoehorning the hashes into file systems not built for it), but it > avoid adding tons of code and beeing very different. And if we copy the ext4 method of putting the merkle data after eof and loading it into the pagecache, how much of the generic fs/verity cleanup patches do we really need? --D
On Fri, May 17, 2024 at 10:17:20AM -0700, Darrick J. Wong wrote: > > Note that the verity metadata *must* be encrypted when the file is, > > since it contains hashes of the plaintext data. > > Refresh my memory of fscrypt -- does it encrypt directory names, xattr > names, and xattr values too? Or does it only do that to file data? It does encrypt the file names in the directories, but nothing in xattrs as far as I can tell. > And if we copy the ext4 method of putting the merkle data after eof and > loading it into the pagecache, how much of the generic fs/verity cleanup > patches do we really need? We shouldn't need anything. A bunch of cleanup and the support for not generating a hash for holes might still be nice to have, though.
On Mon, May 20, 2024 at 05:39:59AM -0700, Christoph Hellwig wrote: > On Fri, May 17, 2024 at 10:17:20AM -0700, Darrick J. Wong wrote: > > > Note that the verity metadata *must* be encrypted when the file is, > > > since it contains hashes of the plaintext data. > > > > Refresh my memory of fscrypt -- does it encrypt directory names, xattr > > names, and xattr values too? Or does it only do that to file data? > > It does encrypt the file names in the directories, but nothing in > xattrs as far as I can tell. Do we want that for user.* attrs? That seems like quite an omission. > > And if we copy the ext4 method of putting the merkle data after eof and > > loading it into the pagecache, how much of the generic fs/verity cleanup > > patches do we really need? > > We shouldn't need anything. A bunch of cleanup Should we do the read/drop_merkle_tree_block cleanup anyway? One of the advantages of xfs caching merkle tree blocks ourselves is that we neither extend the usage of PageChecked when merkle blocksize == pagesize nor become subject to the 1-million merkle block limit when merkle blocksize < pagesize. There's a tripping hazard if you mount a 4k merkle block filesystem on a computer with 64k pages -- now you can't open 6T verity files. That said, it also sounds dumb to maintain a separate index for pagecache pages to track a single bit. Maybe we should port verity to use xbitmap64 from xfs instead of single static buffer? > and the support for not > generating a hash for holes might still be nice to have, though. Yeah, though I think that's only usable for xfs if merkle blocksize == fs blocksize, since xfs doesn't store sub-fsblock unwritten state. --D
On Mon, May 20, 2024 at 09:02:59AM -0700, Darrick J. Wong wrote: > On Mon, May 20, 2024 at 05:39:59AM -0700, Christoph Hellwig wrote: > > On Fri, May 17, 2024 at 10:17:20AM -0700, Darrick J. Wong wrote: > > > > Note that the verity metadata *must* be encrypted when the file is, > > > > since it contains hashes of the plaintext data. > > > > > > Refresh my memory of fscrypt -- does it encrypt directory names, xattr > > > names, and xattr values too? Or does it only do that to file data? > > > > It does encrypt the file names in the directories, but nothing in > > xattrs as far as I can tell. > > Do we want that for user.* attrs? That seems like quite an omission. I'll let Eric answer that. Btw, is the threat model for fscrypt written down somewhere? > > > And if we copy the ext4 method of putting the merkle data after eof and > > > loading it into the pagecache, how much of the generic fs/verity cleanup > > > patches do we really need? > > > > We shouldn't need anything. A bunch of cleanup > > Should we do the read/drop_merkle_tree_block cleanup anyway? To me the block based interface seems a lot cleaner, but Eric has some reservations due to the added indirect call on the drop side. > One of the advantages of xfs caching merkle tree blocks ourselves > is that we neither extend the usage of PageChecked when merkle blocksize > == pagesize nor become subject to the 1-million merkle block limit when > merkle blocksize < pagesize. There's a tripping hazard if you mount a 4k > merkle block filesystem on a computer with 64k pages -- now you can't > open 6T verity files. > > That said, it also sounds dumb to maintain a separate index for > pagecache pages to track a single bit. Yeah. As I mentioned earlier I think fsverify really should enforce a size limit. Right now it will simply run out space eventually which doesn't seem like a nice failure mode. > Maybe we should port verity to > use xbitmap64 from xfs instead of single static buffer? The seems like a bit of overkill for the current use cases.
On Wed, May 22, 2024 at 07:37:23AM -0700, Christoph Hellwig wrote: > On Mon, May 20, 2024 at 09:02:59AM -0700, Darrick J. Wong wrote: > > On Mon, May 20, 2024 at 05:39:59AM -0700, Christoph Hellwig wrote: > > > On Fri, May 17, 2024 at 10:17:20AM -0700, Darrick J. Wong wrote: > > > > > Note that the verity metadata *must* be encrypted when the file is, > > > > > since it contains hashes of the plaintext data. > > > > > > > > Refresh my memory of fscrypt -- does it encrypt directory names, xattr > > > > names, and xattr values too? Or does it only do that to file data? > > > > > > It does encrypt the file names in the directories, but nothing in > > > xattrs as far as I can tell. > > > > Do we want that for user.* attrs? That seems like quite an omission. > > I'll let Eric answer that. Btw, is the threat model for fscrypt written > down somewhere? See https://www.kernel.org/doc/html/latest/filesystems/fscrypt.html?highlight=fscrypt#threat-model As for why it stopped at file contents and names (and fsverity Merkle tree blocks which ext4 and f2fs encrypt in the same way as contents), it's just because it's very difficult to add more, and file contents and names alone were enough for parity with most other file-level encryption systems. That's just the nature of file-level encryption. For each additional type of data or metadata that's encrypted, there are a huge number of things that need to be resolved including algorithm selection, key derivation, IV selection, on-disk format, padding, UAPI for enabling the feature, userspace tool support including fsck and debugging tools, access semantics without the key, etc... xattr encryption is definitely something that people have thought about, and it probably would be the next thing to consider after the existing contents and names. Encrypting the exact file sizes is also something to consider. But it's not something that someone has volunteered to do all the work for (yet). If you restricted it to the contents of user xattrs only (not other xattrs, and not xattr names), it would be more feasible than trying to encrypt the names and values of all xattrs, though it would still be difficult. Of course, generally speaking, when adding fscrypt support to a filesystem, it's going to be much easier to just target the existing feature set, and not try to include new, unproven features too. (FWIW, this also applies to fsverity.) If someone is interested in taking on an experimental project add xattr encryption support, I'd be glad to try to provide guidance, but it probably should be separated out from adding fscrypt support in the first place. > > > > And if we copy the ext4 method of putting the merkle data after eof and > > > > loading it into the pagecache, how much of the generic fs/verity cleanup > > > > patches do we really need? > > > > > > We shouldn't need anything. A bunch of cleanup > > > > Should we do the read/drop_merkle_tree_block cleanup anyway? > > To me the block based interface seems a lot cleaner, but Eric has some > reservations due to the added indirect call on the drop side. The point of the block based interface is really so that filesystems could actually do the block-based caching. If XFS isn't going to end up using that after all, I think we should just stay with ->read_merkle_tree_page for now. > > One of the advantages of xfs caching merkle tree blocks ourselves > > is that we neither extend the usage of PageChecked when merkle blocksize > > == pagesize nor become subject to the 1-million merkle block limit when > > merkle blocksize < pagesize. There's a tripping hazard if you mount a 4k > > merkle block filesystem on a computer with 64k pages -- now you can't > > open 6T verity files. > > > > That said, it also sounds dumb to maintain a separate index for > > pagecache pages to track a single bit. > > Yeah. As I mentioned earlier I think fsverify really should enforce > a size limit. Right now it will simply run out space eventually which > doesn't seem like a nice failure mode. You could enforce the limit of 1 << 23 Merkle tree blocks regardless of whether the block size and page size are equal, if you want. This probably would need to be specific to XFS, though, as it would be a breaking change for other filesystems. As for ext4 and f2fs failing with EFBIG in the middle of building the Merkle tree due to it exceeding s_maxbytes, yes that could be detected at the beginning of building the tree to get the error right away. This just isn't done currently because the current way is simpler, and this case isn't really encountered in practice so there isn't too much reason to optimize it. - Eric
On Wed, May 22, 2024 at 11:29:00AM -0700, Eric Biggers wrote: > On Wed, May 22, 2024 at 07:37:23AM -0700, Christoph Hellwig wrote: > > On Mon, May 20, 2024 at 09:02:59AM -0700, Darrick J. Wong wrote: > > > On Mon, May 20, 2024 at 05:39:59AM -0700, Christoph Hellwig wrote: > > > > On Fri, May 17, 2024 at 10:17:20AM -0700, Darrick J. Wong wrote: > > > > > > Note that the verity metadata *must* be encrypted when the file is, > > > > > > since it contains hashes of the plaintext data. > > > > > > > > > > Refresh my memory of fscrypt -- does it encrypt directory names, xattr > > > > > names, and xattr values too? Or does it only do that to file data? > > > > > > > > It does encrypt the file names in the directories, but nothing in > > > > xattrs as far as I can tell. > > > > > > Do we want that for user.* attrs? That seems like quite an omission. > > > > I'll let Eric answer that. Btw, is the threat model for fscrypt written > > down somewhere? > > See https://www.kernel.org/doc/html/latest/filesystems/fscrypt.html?highlight=fscrypt#threat-model > > As for why it stopped at file contents and names (and fsverity Merkle tree > blocks which ext4 and f2fs encrypt in the same way as contents), it's just > because it's very difficult to add more, and file contents and names alone were > enough for parity with most other file-level encryption systems. That's just > the nature of file-level encryption. For each additional type of data or > metadata that's encrypted, there are a huge number of things that need to be > resolved including algorithm selection, key derivation, IV selection, on-disk > format, padding, UAPI for enabling the feature, userspace tool support including > fsck and debugging tools, access semantics without the key, etc... > > xattr encryption is definitely something that people have thought about, and it > probably would be the next thing to consider after the existing contents and > names. Encrypting the exact file sizes is also something to consider. But it's > not something that someone has volunteered to do all the work for (yet). If you > restricted it to the contents of user xattrs only (not other xattrs, and not > xattr names), it would be more feasible than trying to encrypt the names and > values of all xattrs, though it would still be difficult. > > Of course, generally speaking, when adding fscrypt support to a filesystem, it's > going to be much easier to just target the existing feature set, and not try to > include new, unproven features too. (FWIW, this also applies to fsverity.) If > someone is interested in taking on an experimental project add xattr encryption > support, I'd be glad to try to provide guidance, but it probably should be > separated out from adding fscrypt support in the first place. Does the fscrypt data unit size have to match i_blocksize? Or the fsverity merkle tree block size? > > > > > And if we copy the ext4 method of putting the merkle data after eof and > > > > > loading it into the pagecache, how much of the generic fs/verity cleanup > > > > > patches do we really need? > > > > > > > > We shouldn't need anything. A bunch of cleanup > > > > > > Should we do the read/drop_merkle_tree_block cleanup anyway? > > > > To me the block based interface seems a lot cleaner, but Eric has some > > reservations due to the added indirect call on the drop side. > > The point of the block based interface is really so that filesystems could > actually do the block-based caching. If XFS isn't going to end up using that > after all, I think we should just stay with ->read_merkle_tree_page for now. TBH I'd rather keep the existing caching mechanism we have now and not mess further with the pagecache. But that really depends on who's going to get this patchset across the finish line. > > > One of the advantages of xfs caching merkle tree blocks ourselves > > > is that we neither extend the usage of PageChecked when merkle blocksize > > > == pagesize nor become subject to the 1-million merkle block limit when > > > merkle blocksize < pagesize. There's a tripping hazard if you mount a 4k > > > merkle block filesystem on a computer with 64k pages -- now you can't > > > open 6T verity files. > > > > > > That said, it also sounds dumb to maintain a separate index for > > > pagecache pages to track a single bit. > > > > Yeah. As I mentioned earlier I think fsverify really should enforce > > a size limit. Right now it will simply run out space eventually which > > doesn't seem like a nice failure mode. > > You could enforce the limit of 1 << 23 Merkle tree blocks regardless of whether > the block size and page size are equal, if you want. This probably would need > to be specific to XFS, though, as it would be a breaking change for other > filesystems. Hmm. If xfs stores the merkle data in posteof blocks, then I guess the following constraint has to hold: roundup_64(isize, rounding) + merkle_data_size < 2^63-1. where rounding = max(file allocation unit, max folio size) The "max folio size" part creates problems because the folio size changes depending on arch and whether or not large folios are enabled, and we cannot spill merkle tree data into an mmapped folio that crosses EOF. I guess we could hack around that by setting rounding = 1TB or something, and then fail the begin_enable_verity call if that relationship doesn't hold. Then we'd store the merkle data at file position roundup_64(isize, rounding). > As for ext4 and f2fs failing with EFBIG in the middle of building the Merkle > tree due to it exceeding s_maxbytes, yes that could be detected at the beginning > of building the tree to get the error right away. This just isn't done > currently because the current way is simpler, and this case isn't really > encountered in practice so there isn't too much reason to optimize it. Yeah I don't think it matters if we run out of space or whatever during construction of the merkle data since (if we switch to using posteof space) we can just truncate at eof to get rid of the old tree, just like everyone else. --D > - Eric >
On Fri, May 31, 2024 at 02:28:50PM -0700, Darrick J. Wong wrote: > On Wed, May 22, 2024 at 11:29:00AM -0700, Eric Biggers wrote: > > On Wed, May 22, 2024 at 07:37:23AM -0700, Christoph Hellwig wrote: > > > On Mon, May 20, 2024 at 09:02:59AM -0700, Darrick J. Wong wrote: > > > > On Mon, May 20, 2024 at 05:39:59AM -0700, Christoph Hellwig wrote: > > > > > On Fri, May 17, 2024 at 10:17:20AM -0700, Darrick J. Wong wrote: > > > > > > > Note that the verity metadata *must* be encrypted when the file is, > > > > > > > since it contains hashes of the plaintext data. > > > > > > > > > > > > Refresh my memory of fscrypt -- does it encrypt directory names, xattr > > > > > > names, and xattr values too? Or does it only do that to file data? > > > > > > > > > > It does encrypt the file names in the directories, but nothing in > > > > > xattrs as far as I can tell. > > > > > > > > Do we want that for user.* attrs? That seems like quite an omission. > > > > > > I'll let Eric answer that. Btw, is the threat model for fscrypt written > > > down somewhere? > > > > See https://www.kernel.org/doc/html/latest/filesystems/fscrypt.html?highlight=fscrypt#threat-model > > > > As for why it stopped at file contents and names (and fsverity Merkle tree > > blocks which ext4 and f2fs encrypt in the same way as contents), it's just > > because it's very difficult to add more, and file contents and names alone were > > enough for parity with most other file-level encryption systems. That's just > > the nature of file-level encryption. For each additional type of data or > > metadata that's encrypted, there are a huge number of things that need to be > > resolved including algorithm selection, key derivation, IV selection, on-disk > > format, padding, UAPI for enabling the feature, userspace tool support including > > fsck and debugging tools, access semantics without the key, etc... > > > > xattr encryption is definitely something that people have thought about, and it > > probably would be the next thing to consider after the existing contents and > > names. Encrypting the exact file sizes is also something to consider. But it's > > not something that someone has volunteered to do all the work for (yet). If you > > restricted it to the contents of user xattrs only (not other xattrs, and not > > xattr names), it would be more feasible than trying to encrypt the names and > > values of all xattrs, though it would still be difficult. > > > > Of course, generally speaking, when adding fscrypt support to a filesystem, it's > > going to be much easier to just target the existing feature set, and not try to > > include new, unproven features too. (FWIW, this also applies to fsverity.) If > > someone is interested in taking on an experimental project add xattr encryption > > support, I'd be glad to try to provide guidance, but it probably should be > > separated out from adding fscrypt support in the first place. > > Does the fscrypt data unit size have to match i_blocksize? Or the > fsverity merkle tree block size? The crypto data unit size is a power of 2 less than or equal to i_blocksize (usually equal to it, as that is the default). See https://www.kernel.org/doc/html/latest/filesystems/fscrypt.html#contents-encryption It applies to contents encryption only, though as you know, ext4 and f2fs treat the Merkle tree in mostly the same way as contents, including for encryption. For an implementation of fsverity that doesn't treat the Merkle tree as contents like this, and where the filesystem supports fscrypt too, encryption of the Merkle tree would need to be implemented a bit differently. It wouldn't need to use the same crypto data unit size as the contents, e.g. it could be always one data unit per Merkle tree block. - Eric
diff --git a/fs/xfs/libxfs/xfs_attr.c b/fs/xfs/libxfs/xfs_attr.c index 953a82d70223e..d21a743f90ea7 100644 --- a/fs/xfs/libxfs/xfs_attr.c +++ b/fs/xfs/libxfs/xfs_attr.c @@ -462,6 +462,8 @@ xfs_attr_hashval( if (attr_flags & XFS_ATTR_PARENT) return xfs_parent_hashattr(mp, name, namelen, value, valuelen); + if (attr_flags & XFS_ATTR_VERITY) + return xfs_verity_hashname(name, namelen); return xfs_attr_hashname(name, namelen); } diff --git a/fs/xfs/libxfs/xfs_da_format.h b/fs/xfs/libxfs/xfs_da_format.h index 43e9d1f00a4ab..c95e8ca22daad 100644 --- a/fs/xfs/libxfs/xfs_da_format.h +++ b/fs/xfs/libxfs/xfs_da_format.h @@ -943,4 +943,10 @@ struct xfs_merkle_key { #define XFS_VERITY_DESCRIPTOR_NAME "vdesc" #define XFS_VERITY_DESCRIPTOR_NAME_LEN (sizeof(XFS_VERITY_DESCRIPTOR_NAME) - 1) +/* + * Merkle tree blocks cannot be smaller than 1k in size, so the hash function + * can right-shift the merkle offset by this amount without losing anything. + */ +#define XFS_VERITY_HASH_SHIFT (10) + #endif /* __XFS_DA_FORMAT_H__ */ diff --git a/fs/xfs/libxfs/xfs_verity.c b/fs/xfs/libxfs/xfs_verity.c index ff02c5c840b58..8c470014b915c 100644 --- a/fs/xfs/libxfs/xfs_verity.c +++ b/fs/xfs/libxfs/xfs_verity.c @@ -56,3 +56,19 @@ xfs_verity_namecheck( return true; } + +/* + * Compute name hash for a verity attribute. For merkle tree blocks, we want + * to use the merkle tree block offset as the hash value to avoid collisions + * between blocks unless the merkle tree becomes larger than 2^32 blocks. + */ +xfs_dahash_t +xfs_verity_hashname( + const uint8_t *name, + unsigned int namelen) +{ + if (namelen != sizeof(struct xfs_merkle_key)) + return xfs_attr_hashname(name, namelen); + + return xfs_merkle_key_from_disk(name, namelen) >> XFS_VERITY_HASH_SHIFT; +} diff --git a/fs/xfs/libxfs/xfs_verity.h b/fs/xfs/libxfs/xfs_verity.h index 5813665c5a01e..3d7485c511d58 100644 --- a/fs/xfs/libxfs/xfs_verity.h +++ b/fs/xfs/libxfs/xfs_verity.h @@ -9,5 +9,6 @@ void xfs_merkle_key_to_disk(struct xfs_merkle_key *key, uint64_t pos); uint64_t xfs_merkle_key_from_disk(const void *attr_name, int namelen); bool xfs_verity_namecheck(unsigned int attr_flags, const void *name, int namelen); +xfs_dahash_t xfs_verity_hashname(const uint8_t *name, unsigned int namelen); #endif /* __XFS_VERITY_H__ */