Message ID | 20221213172935.680971-1-aalbersh@redhat.com (mailing list archive) |
---|---|
Headers | show |
Series | fs-verity support for XFS | expand |
On Tue, Dec 13, 2022 at 06:29:24PM +0100, Andrey Albershteyn wrote: > Not yet implemented: > - No pre-fetching of Merkle tree pages in the > read_merkle_tree_page() This would be helpful, but not essential. > - No marking of already verified Merkle tree pages (each read, the > whole tree is verified). This is essential to have, IMO. You *could* do what btrfs does, where it caches the Merkle tree pages in the inode's page cache past i_size, even though btrfs stores the Merkle tree separately from the file data on-disk. However, I'd guess that the other XFS developers would have an adversion to that approach, even though it would not affect the on-disk storage. The alternatives would be to create a separate in-memory-only inode for the cache, or to build a custom cache with its own shrinker. - Eric
On Tue, Dec 13, 2022 at 12:50:28PM -0800, Eric Biggers wrote: > On Tue, Dec 13, 2022 at 06:29:24PM +0100, Andrey Albershteyn wrote: > > Not yet implemented: > > - No pre-fetching of Merkle tree pages in the > > read_merkle_tree_page() > > This would be helpful, but not essential. > > > - No marking of already verified Merkle tree pages (each read, the > > whole tree is verified). Ah, I wasn't aware that this was missing. > > This is essential to have, IMO. > > You *could* do what btrfs does, where it caches the Merkle tree pages in the > inode's page cache past i_size, even though btrfs stores the Merkle tree > separately from the file data on-disk. > > However, I'd guess that the other XFS developers would have an adversion to that > approach, even though it would not affect the on-disk storage. Yup, on an architectural level it just seems wrong to cache secure verification metadata in the same user accessible address space as the data it verifies. > The alternatives would be to create a separate in-memory-only inode for the > cache, or to build a custom cache with its own shrinker. The merkel tree blocks are cached in the XFS buffer cache. Andrey, could we just add a new flag to the xfs_buf->b_flags to indicate that the buffer contains verified merkle tree records? i.e. if it's not set after we've read the buffer, we need to verify the buffer and set th verified buffer in cache and we can skip the verification? Cheers, Dave.
On Wed, Dec 14, 2022 at 09:11:39AM +1100, Dave Chinner wrote: > On Tue, Dec 13, 2022 at 12:50:28PM -0800, Eric Biggers wrote: > > On Tue, Dec 13, 2022 at 06:29:24PM +0100, Andrey Albershteyn wrote: > > > Not yet implemented: > > > - No pre-fetching of Merkle tree pages in the > > > read_merkle_tree_page() > > > > This would be helpful, but not essential. > > > > > - No marking of already verified Merkle tree pages (each read, the > > > whole tree is verified). > > Ah, I wasn't aware that this was missing. > > > > > This is essential to have, IMO. > > > > You *could* do what btrfs does, where it caches the Merkle tree pages in the > > inode's page cache past i_size, even though btrfs stores the Merkle tree > > separately from the file data on-disk. > > > > However, I'd guess that the other XFS developers would have an adversion to that > > approach, even though it would not affect the on-disk storage. > > Yup, on an architectural level it just seems wrong to cache secure > verification metadata in the same user accessible address space as > the data it verifies. > > > The alternatives would be to create a separate in-memory-only inode for the > > cache, or to build a custom cache with its own shrinker. > > The merkel tree blocks are cached in the XFS buffer cache. > > Andrey, could we just add a new flag to the xfs_buf->b_flags to > indicate that the buffer contains verified merkle tree records? > i.e. if it's not set after we've read the buffer, we need to verify > the buffer and set th verified buffer in cache and we can skip the > verification? Well, my proposal at https://lore.kernel.org/r/20221028224539.171818-2-ebiggers@kernel.org is to keep tracking the "verified" status at the individual Merkle tree block level, by adding a bitmap fsverity_info::hash_block_verified. That is part of the fs/verity/ infrastructure, and all filesystems would be able to use it. However, since it's necessary to re-verify blocks that have been evicted and then re-instantiated, my patch also repurposes PG_checked as an indicator for whether the Merkle tree pages are newly instantiated. For a "non-page-cache cache", that part would need to be replaced with something equivalent. A different aproach would be to make it so that every time a page (or "cache buffer", to call it something more generic) of N Merkle tree blocks is read, then all N of those blocks are verified immediately. Then there would be no need to track the "verified" status of individual blocks. My concerns with that approach are: * Most data reads only need a single Merkle tree block at the deepest level. If at least N tree blocks were verified any time that any were verified at all, that would make the worst-case read latency worse. * It's possible that the parents of N tree blocks are split across a cache buffer. Thus, while N blocks can't have more than N parents, and in practice would just have 1-2, those 2 parents could be split into two separate cache buffers, with a total length of 2*N. Verifying all of those would really increase the worst-case latency as well. So I'm thinking that tracking the "verified" status of tree blocks individually is the right way to go. But I'd appreciate any other thoughts on this. - Eric
On Tue, Dec 13, 2022 at 10:31:42PM -0800, Eric Biggers wrote: > On Wed, Dec 14, 2022 at 09:11:39AM +1100, Dave Chinner wrote: > > On Tue, Dec 13, 2022 at 12:50:28PM -0800, Eric Biggers wrote: > > > On Tue, Dec 13, 2022 at 06:29:24PM +0100, Andrey Albershteyn wrote: > > > > Not yet implemented: > > > > - No pre-fetching of Merkle tree pages in the > > > > read_merkle_tree_page() > > > > > > This would be helpful, but not essential. > > > > > > > - No marking of already verified Merkle tree pages (each read, the > > > > whole tree is verified). > > > > Ah, I wasn't aware that this was missing. > > > > > > > > This is essential to have, IMO. > > > > > > You *could* do what btrfs does, where it caches the Merkle tree pages in the > > > inode's page cache past i_size, even though btrfs stores the Merkle tree > > > separately from the file data on-disk. > > > > > > However, I'd guess that the other XFS developers would have an adversion to that > > > approach, even though it would not affect the on-disk storage. > > > > Yup, on an architectural level it just seems wrong to cache secure > > verification metadata in the same user accessible address space as > > the data it verifies. > > > > > The alternatives would be to create a separate in-memory-only inode for the > > > cache, or to build a custom cache with its own shrinker. > > > > The merkel tree blocks are cached in the XFS buffer cache. > > > > Andrey, could we just add a new flag to the xfs_buf->b_flags to > > indicate that the buffer contains verified merkle tree records? > > i.e. if it's not set after we've read the buffer, we need to verify > > the buffer and set th verified buffer in cache and we can skip the > > verification? > > Well, my proposal at > https://lore.kernel.org/r/20221028224539.171818-2-ebiggers@kernel.org is to keep > tracking the "verified" status at the individual Merkle tree block level, by > adding a bitmap fsverity_info::hash_block_verified. That is part of the > fs/verity/ infrastructure, and all filesystems would be able to use it. Yeah, i had a look at that rewrite of the verification code last night - I get the gist of what it is doing, but a single patch of that complexity is largely impossible to sanely review... Correct me if I'm wrong, but won't using a bitmap with 1 bit per verified block cause problems with contiguous memory allocation pretty quickly? i.e. a 64kB bitmap only tracks 512k blocks, which is only 2GB of merkle tree data. Hence at file sizes of 100+GB, the bitmap would have to be kvmalloc()d to guarantee allocation will succeed. I'm not really worried about the bitmap memory usage, just that it handles large contiguous allocations sanely. I suspect we may eventually need a sparse bitmap (e.g. the old btrfs bit-radix implementation) to track verification in really large files efficiently. > However, since it's necessary to re-verify blocks that have been evicted and > then re-instantiated, my patch also repurposes PG_checked as an indicator for > whether the Merkle tree pages are newly instantiated. For a "non-page-cache > cache", that part would need to be replaced with something equivalent. Which we could get as a boolean state from the XFS buffer cache fairly easily - did we find the buffer in cache, or was it read from disk... > A different aproach would be to make it so that every time a page (or "cache > buffer", to call it something more generic) of N Merkle tree blocks is read, > then all N of those blocks are verified immediately. Then there would be no > need to track the "verified" status of individual blocks. That won't work with XFS - merkle tree blocks are not contiguous in the attribute b-tree so there is no efficient "sequential bulk read" option available. The xattr structure is largely chosen because it allows for fast, deterministic single merkle tree block operations.... > My concerns with that approach are: > > * Most data reads only need a single Merkle tree block at the deepest level. Yup, see above. :) > If at least N tree blocks were verified any time that any were verified at > all, that would make the worst-case read latency worse. *nod* > * It's possible that the parents of N tree blocks are split across a cache > buffer. Thus, while N blocks can't have more than N parents, and in > practice would just have 1-2, those 2 parents could be split into two > separate cache buffers, with a total length of 2*N. Verifying all of those > would really increase the worst-case latency as well. > > So I'm thinking that tracking the "verified" status of tree blocks individually > is the right way to go. But I'd appreciate any other thoughts on this. I think that having the fsverity code track verified indexes itself is a much more felxible and self contained and the right way to go about it. The other issue is that verify_page() assumes that it can drop the reference to the cached object itself - the caller actually owns the reference to the object, not the verify_page() code. Hence if we are passing opaque buffers to verify_page() rather page cache pages, we need a ->drop_block method that gets called instead of put_page(). This will allow the filesystem to hold a reference to the merkle tree block data while the verification occurs, ensuring that they don't get reclaimed by memory pressure whilst still in use... Cheers, Dave.
On Thu, Dec 15, 2022 at 10:06:32AM +1100, Dave Chinner wrote: > > Well, my proposal at > > https://lore.kernel.org/r/20221028224539.171818-2-ebiggers@kernel.org is to keep > > tracking the "verified" status at the individual Merkle tree block level, by > > adding a bitmap fsverity_info::hash_block_verified. That is part of the > > fs/verity/ infrastructure, and all filesystems would be able to use it. > > Yeah, i had a look at that rewrite of the verification code last > night - I get the gist of what it is doing, but a single patch of > that complexity is largely impossible to sanely review... Thanks for taking a look at it. It doesn't really lend itself to being split up, unfortunately, but I'll see what I can do. > Correct me if I'm wrong, but won't using a bitmap with 1 bit per > verified block cause problems with contiguous memory allocation > pretty quickly? i.e. a 64kB bitmap only tracks 512k blocks, which is > only 2GB of merkle tree data. Hence at file sizes of 100+GB, the > bitmap would have to be kvmalloc()d to guarantee allocation will > succeed. > > I'm not really worried about the bitmap memory usage, just that it > handles large contiguous allocations sanely. I suspect we may > eventually need a sparse bitmap (e.g. the old btrfs bit-radix > implementation) to track verification in really large files > efficiently. Well, that's why my patch uses kvmalloc() to allocate the bitmap. I did originally think it was going to have to be a sparse bitmap that ties into the shrinker so that pages of it can be evicted. But if you do the math, the required bitmap size is only 1 / 2^22 the size of the file, assuming the Merkle tree uses SHA-256 and 4K blocks. So a 100MB file only needs a 24-byte bitmap, and the bitmap for any file under 17GB fits in a 4K page. My patch puts an arbitrary limit at a 1 MiB bitmap, which would be a 4.4TB file. It's not ideal to say "4 TB Ought To Be Enough For Anybody". But it does feel that it's not currently worth the extra complexity and runtime overhead of implementing a full-blown sparse bitmap with cache eviction support, when no one currently has a use case for fsverity on files anywhere near that large. > The other issue is that verify_page() assumes that it can drop the > reference to the cached object itself - the caller actually owns the > reference to the object, not the verify_page() code. Hence if we are > passing opaque buffers to verify_page() rather page cache pages, we > need a ->drop_block method that gets called instead of put_page(). > This will allow the filesystem to hold a reference to the merkle > tree block data while the verification occurs, ensuring that they > don't get reclaimed by memory pressure whilst still in use... Yes, probably the prototype of ->read_merkle_tree_page will need to change too. - Eric
On Wed, Dec 14, 2022 at 10:47:37PM -0800, Eric Biggers wrote: > On Thu, Dec 15, 2022 at 10:06:32AM +1100, Dave Chinner wrote: > > > Well, my proposal at > > > https://lore.kernel.org/r/20221028224539.171818-2-ebiggers@kernel.org is to keep > > > tracking the "verified" status at the individual Merkle tree block level, by > > > adding a bitmap fsverity_info::hash_block_verified. That is part of the > > > fs/verity/ infrastructure, and all filesystems would be able to use it. > > > > Yeah, i had a look at that rewrite of the verification code last > > night - I get the gist of what it is doing, but a single patch of > > that complexity is largely impossible to sanely review... > > Thanks for taking a look at it. It doesn't really lend itself to being split > up, unfortunately, but I'll see what I can do. > > > Correct me if I'm wrong, but won't using a bitmap with 1 bit per > > verified block cause problems with contiguous memory allocation > > pretty quickly? i.e. a 64kB bitmap only tracks 512k blocks, which is > > only 2GB of merkle tree data. Hence at file sizes of 100+GB, the > > bitmap would have to be kvmalloc()d to guarantee allocation will > > succeed. > > > > I'm not really worried about the bitmap memory usage, just that it > > handles large contiguous allocations sanely. I suspect we may > > eventually need a sparse bitmap (e.g. the old btrfs bit-radix > > implementation) to track verification in really large files > > efficiently. > > Well, that's why my patch uses kvmalloc() to allocate the bitmap. > > I did originally think it was going to have to be a sparse bitmap that ties into > the shrinker so that pages of it can be evicted. But if you do the math, the > required bitmap size is only 1 / 2^22 the size of the file, assuming the Merkle > tree uses SHA-256 and 4K blocks. So a 100MB file only needs a 24-byte bitmap, > and the bitmap for any file under 17GB fits in a 4K page. > > My patch puts an arbitrary limit at a 1 MiB bitmap, which would be a 4.4TB file. > > It's not ideal to say "4 TB Ought To Be Enough For Anybody". But it does feel > that it's not currently worth the extra complexity and runtime overhead of > implementing a full-blown sparse bitmap with cache eviction support, when no one > currently has a use case for fsverity on files anywhere near that large. I think we can live with that for the moment, but I suspect that 4TB filesize limit will become an issue sooner rather than later. What will happen if someone tries to measure a file larger than this limit? What's the failure mode? Cheers, Dave.
On Fri, Dec 16, 2022 at 07:57:37AM +1100, Dave Chinner wrote: > > > > Well, that's why my patch uses kvmalloc() to allocate the bitmap. > > > > I did originally think it was going to have to be a sparse bitmap that ties into > > the shrinker so that pages of it can be evicted. But if you do the math, the > > required bitmap size is only 1 / 2^22 the size of the file, assuming the Merkle > > tree uses SHA-256 and 4K blocks. So a 100MB file only needs a 24-byte bitmap, > > and the bitmap for any file under 17GB fits in a 4K page. > > > > My patch puts an arbitrary limit at a 1 MiB bitmap, which would be a 4.4TB file. > > > > It's not ideal to say "4 TB Ought To Be Enough For Anybody". But it does feel > > that it's not currently worth the extra complexity and runtime overhead of > > implementing a full-blown sparse bitmap with cache eviction support, when no one > > currently has a use case for fsverity on files anywhere near that large. > > I think we can live with that for the moment, but I suspect that 4TB > filesize limit will become an issue sooner rather than later. What > will happen if someone tries to measure a file larger than this > limit? What's the failure mode? > FS_IOC_ENABLE_VERITY will fail with EFBIG. - Eric