diff mbox series

[03/13] fsverity: support block-based Merkle tree caching

Message ID 171175867913.1987804.4275409634176359386.stgit@frogsfrogsfrogs (mailing list archive)
State Superseded
Headers show
Series [01/13] fs: add FS_XFLAG_VERITY for verity files | expand

Commit Message

Darrick J. Wong March 30, 2024, 12:33 a.m. UTC
From: Andrey Albershteyn <aalbersh@redhat.com>

In the current implementation fs-verity expects filesystem to provide
PAGEs filled with Merkle tree blocks. Then, when fs-verity is done with
processing the blocks, reference to PAGE is freed. This doesn't fit well
with the way XFS manages its memory.

To allow XFS integrate fs-verity this patch adds ability to fs-verity
verification code to take Merkle tree blocks instead of PAGE reference.
This way ext4, f2fs, and btrfs are still able to pass PAGE references
and XFS can pass reference to Merkle tree blocks stored in XFS's
extended attribute infrastructure.

To achieve this, the XFS implementation will implement its own incore
merkle tree block cache.  These blocks will be passed to fsverity when
it needs to read a merkle tree block, and dropped  by fsverity when
validation completes.  The cache will keep track of whether or not a
given block has already been verified, which will improve performance on
random reads.

Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com>
[djwong: fix uninit err variable, remove dependency on bitmap, apply
 various suggestions from maintainer, tighten changelog]
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/verity/fsverity_private.h |   37 ++++++++++
 fs/verity/open.c             |    7 ++
 fs/verity/read_metadata.c    |   63 ++++++++---------
 fs/verity/verify.c           |  152 ++++++++++++++++++++++++++++++------------
 include/linux/fsverity.h     |   76 +++++++++++++++++++++
 5 files changed, 255 insertions(+), 80 deletions(-)

Comments

Eric Biggers April 5, 2024, 2:31 a.m. UTC | #1
On Fri, Mar 29, 2024 at 05:33:27PM -0700, Darrick J. Wong wrote:
> diff --git a/fs/verity/fsverity_private.h b/fs/verity/fsverity_private.h
> index b3506f56e180b..c9d97c2bebd84 100644
> --- a/fs/verity/fsverity_private.h
> +++ b/fs/verity/fsverity_private.h
> @@ -154,4 +154,41 @@ static inline void fsverity_init_signature(void)
>  
>  void __init fsverity_init_workqueue(void);
>  
> +static inline bool fsverity_caches_blocks(const struct inode *inode)
> +{
> +	const struct fsverity_operations *vops = inode->i_sb->s_vop;
> +
> +	WARN_ON_ONCE(vops->read_merkle_tree_block &&
> +		     !vops->drop_merkle_tree_block);
> +
> +	return vops->read_merkle_tree_block != NULL;
> +}
> +
> +static inline bool fsverity_uses_bitmap(const struct fsverity_info *vi,
> +					const struct inode *inode)
> +{
> +	if (fsverity_caches_blocks(inode))
> +		return false;
> +
> +	/*
> +	 * If fs uses block-based Merkle tree caching, then fs-verity must use
> +	 * hash_block_verified bitmap as there's no page to mark it with
> +	 * PG_checked.
> +	 */
> +	return vi->tree_params.block_size != PAGE_SIZE;
> +}

For checking whether the bitmap is in use, it's simpler and more efficient to
just directly check whether vi->hash_block_verified is NULL, as the existing
code does.  Only the code that allocates the bitmap actually needs to be aware
of the details of when the bitmap gets enabled.

fsverity_caches_blocks() has a similar issue, where it could just be replaced
with checking vops->read_merkle_tree_block != NULL directly (or equivalently
vops->drop_merkle_tree_block, which works well in
fsverity_drop_merkle_tree_block() since that's the function pointer it's
calling).  The WARN_ON_ONCE() should be done in fsverity_create_info(), not
inlined multiple times into the verification code.

(I think the name "fsverity_caches_blocks()" is also confusing because the
fsverity support layer does not cache blocks, but all the filesystems do.  What
it's actually trying to convey is a difference in how the filesystem caches the
blocks, which I don't think it does a good job at.)

> +int fsverity_read_merkle_tree_block(struct inode *inode,
> +				    const struct merkle_tree_params *params,
> +				    u64 pos, unsigned long ra_bytes,
> +				    struct fsverity_blockbuf *block);
> +
> +/*
> + * Drop 'block' obtained with ->read_merkle_tree_block(). Calls out back to
> + * filesystem if ->drop_merkle_tree_block() is set, otherwise, drop the
> + * reference in the block->context.

"drop the reference" => "drops the reference"

> +	/*
> +	 * If fs uses block-based Merkle tree cachin, then fs-verity must use
> +	 * hash_block_verified bitmap as there's no page to mark it with
> +	 * PG_checked.
> +	 */
> +	if (fsverity_uses_bitmap(vi, inode)) {

The comment contradicts the code.

> diff --git a/fs/verity/read_metadata.c b/fs/verity/read_metadata.c
> index f58432772d9ea..94fffa060f829 100644
> --- a/fs/verity/read_metadata.c
> +++ b/fs/verity/read_metadata.c
> @@ -14,65 +14,60 @@
>  
>  static int fsverity_read_merkle_tree(struct inode *inode,
>  				     const struct fsverity_info *vi,
> -				     void __user *buf, u64 offset, int length)
> +				     void __user *buf, u64 pos, int length)
>  {
> -	const struct fsverity_operations *vops = inode->i_sb->s_vop;
> -	u64 end_offset;
> -	unsigned int offs_in_page;
> -	pgoff_t index, last_index;
> +	const u64 end_pos = min(pos + length, vi->tree_params.tree_size);
> +	const struct merkle_tree_params *params = &vi->tree_params;
> +	unsigned int offs_in_block = pos & (params->block_size - 1);
>  	int retval = 0;
>  	int err = 0;
>  
> -	end_offset = min(offset + length, vi->tree_params.tree_size);
> -	if (offset >= end_offset)
> +	if (pos >= end_pos)
>  		return 0;

The above 'pos >= end_pos' check is no longer necessary.

>  	/*
> -	 * Iterate through each Merkle tree page in the requested range and copy
> -	 * the requested portion to userspace.  Note that the Merkle tree block
> -	 * size isn't important here, as we are returning a byte stream; i.e.,
> -	 * we can just work with pages even if the tree block size != PAGE_SIZE.
> +	 * Iterate through each Merkle tree block in the requested range and
> +	 * copy the requested portion to userspace. Note that we are returning
> +	 * a byte stream.
>  	 */
> -	for (index = offset >> PAGE_SHIFT; index <= last_index; index++) {
> -		unsigned long num_ra_pages =
> -			min_t(unsigned long, last_index - index + 1,
> -			      inode->i_sb->s_bdi->io_pages);
> -		unsigned int bytes_to_copy = min_t(u64, end_offset - offset,
> -						   PAGE_SIZE - offs_in_page);
> -		struct page *page;
> -		const void *virt;
> +	while (pos < end_pos) {
> +		unsigned long ra_bytes;
> +		unsigned int bytes_to_copy;
> +		struct fsverity_blockbuf block = {
> +			.size = params->block_size,
> +		};
>  
> -		page = vops->read_merkle_tree_page(inode, index, num_ra_pages);
> -		if (IS_ERR(page)) {
> -			err = PTR_ERR(page);
> +		ra_bytes = min_t(unsigned long, end_pos - pos + 1,
> +				 inode->i_sb->s_bdi->io_pages << PAGE_SHIFT);

This introduces an off-by-one error in the calculation of ra_bytes.  end_pos
is exclusive, but the calculation of ra_bytes assumes it is inclusive.

Also, might io_pages << PAGE_SHIFT overflow an unsigned long?

> +		bytes_to_copy = min_t(u64, end_pos - pos,
> +				      params->block_size - offs_in_block);
> +
> +		err = fsverity_read_merkle_tree_block(inode, &vi->tree_params,
> +				pos - offs_in_block, ra_bytes, &block);
> +		if (err) {
>  			fsverity_err(inode,
> -				     "Error %d reading Merkle tree page %lu",
> -				     err, index);
> +				     "Error %d reading Merkle tree block %llu",
> +				     err, pos);
>  			break;

The error message should go into fsverity_read_merkle_tree_block() so that it
does not need to be duplicated in its two callers.  This would, additionally,
eliminate the need to introduce the 'err' variable in verify_data_block().

> diff --git a/fs/verity/verify.c b/fs/verity/verify.c
> index 4fcad0825a120..0b5e11073e883 100644
> --- a/fs/verity/verify.c
> +++ b/fs/verity/verify.c
> @@ -13,14 +13,27 @@
>  static struct workqueue_struct *fsverity_read_workqueue;
>  
>  /*
> - * Returns true if the hash block with index @hblock_idx in the tree, located in
> - * @hpage, has already been verified.
> + * Returns true if the hash block with index @hblock_idx in the tree has
> + * already been verified.
>   */
> -static bool is_hash_block_verified(struct fsverity_info *vi, struct page *hpage,
> +static bool is_hash_block_verified(struct inode *inode,
> +				   struct fsverity_blockbuf *block,
>  				   unsigned long hblock_idx)

The comment should be updated to mention @block.

> @@ -143,11 +156,9 @@ verify_data_block(struct inode *inode, struct fsverity_info *vi,
>  	for (level = 0; level < params->num_levels; level++) {
>  		unsigned long next_hidx;
>  		unsigned long hblock_idx;
> -		pgoff_t hpage_idx;
> -		unsigned int hblock_offset_in_page;
> +		u64 hblock_pos;
>  		unsigned int hoffset;
> -		struct page *hpage;
> -		const void *haddr;
> +		struct fsverity_blockbuf *block = &hblocks[level].block;
>  
>  		/*
>  		 * The index of the block in the current level; also the index
> @@ -158,36 +169,34 @@ verify_data_block(struct inode *inode, struct fsverity_info *vi,
>  		/* Index of the hash block in the tree overall */
>  		hblock_idx = params->level_start[level] + next_hidx;
>  
> -		/* Index of the hash page in the tree overall */
> -		hpage_idx = hblock_idx >> params->log_blocks_per_page;
> -
> -		/* Byte offset of the hash block within the page */
> -		hblock_offset_in_page =
> -			(hblock_idx << params->log_blocksize) & ~PAGE_MASK;
> +		/* Byte offset of the hash block in the tree overall */
> +		hblock_pos = hblock_idx << params->log_blocksize;

'hblock_idx << params->log_blocksize' may overflow an unsigned long, so
'hblock_idx' needs to be cast to u64 before doing the shift.

> +		if (level == 0)
> +			ra_bytes = min(max_ra_bytes,
> +				       params->tree_size - hblock_pos);
> +		else
> +			ra_bytes = 0;

The first argument to min() has type unsigned long, and the second has type u64.
Doesn't this generate a warning on 32-bit systems?

> @@ -325,7 +333,7 @@ void fsverity_verify_bio(struct bio *bio)
>  
>  	bio_for_each_folio_all(fi, bio) {
>  		if (!verify_data_blocks(fi.folio, fi.length, fi.offset,
> -					max_ra_pages)) {
> +					max_ra_pages << PAGE_SHIFT)) {
>  			bio->bi_status = BLK_STS_IOERR;
>  			break;

This function should calculate max_ra_bytes as bio->bi_iter.bi_size >> 2.  It's
not necessary to convert to pages, then back to bytes again.

> +/**
> + * fsverity_read_merkle_tree_block() - read Merkle tree block
> + * @inode: inode to which this Merkle tree blocks belong

"this Merkle tree blocks belong" => "this Merkle tree block belongs"

> + * @params: merkle tree parameters
> + * @pos: byte position within merkle tree
> + * @ra_bytes: try to read ahead this many btes

"btes" => "bytes"

> +int fsverity_read_merkle_tree_block(struct inode *inode,
> +				    const struct merkle_tree_params *params,
> +				    u64 pos, unsigned long ra_bytes,
> +				    struct fsverity_blockbuf *block)
> +{
> +	const struct fsverity_operations *vops = inode->i_sb->s_vop;
> +	unsigned long page_idx;
> +	struct page *page;
> +	unsigned long index;
> +	unsigned int offset_in_page;
> +
> +	if (fsverity_caches_blocks(inode)) {
> +		block->verified = false;
> +		return vops->read_merkle_tree_block(inode, pos, ra_bytes,
> +				params->log_blocksize, block);
> +	}
> +
> +	index = pos >> params->log_blocksize;

Should the fourth parameter to ->read_merkle_tree_block be the block index
(which is computed above) instead of log_blocksize?  XFS only uses
params->log_blocksize to compute the block index anyway.

> +/**
> + * struct fsverity_blockbuf - Merkle Tree block buffer
> + * @kaddr: virtual address of the block's data
> + * @offset: block's offset into Merkle tree
> + * @size: the Merkle tree block size
> + * @context: filesystem private context
> + * @verified: has this buffer been validated?
> + *
> + * Buffer containing single Merkle Tree block. These buffers are passed
> + *  - to filesystem, when fs-verity is building merkel tree,
> + *  - from filesystem, when fs-verity is reading merkle tree from a disk.
> + * Filesystems sets kaddr together with size to point to a memory which contains
> + * Merkle tree block. Same is done by fs-verity when Merkle tree is need to be
> + * written down to disk.

This comment still incorrectly claims that fsverity_blockbuf is being used for
writes.

- Eric
Darrick J. Wong April 24, 2024, 9:25 p.m. UTC | #2
On Thu, Apr 04, 2024 at 10:31:45PM -0400, Eric Biggers wrote:
> On Fri, Mar 29, 2024 at 05:33:27PM -0700, Darrick J. Wong wrote:
> > diff --git a/fs/verity/fsverity_private.h b/fs/verity/fsverity_private.h
> > index b3506f56e180b..c9d97c2bebd84 100644
> > --- a/fs/verity/fsverity_private.h
> > +++ b/fs/verity/fsverity_private.h
> > @@ -154,4 +154,41 @@ static inline void fsverity_init_signature(void)
> >  
> >  void __init fsverity_init_workqueue(void);
> >  
> > +static inline bool fsverity_caches_blocks(const struct inode *inode)
> > +{
> > +	const struct fsverity_operations *vops = inode->i_sb->s_vop;
> > +
> > +	WARN_ON_ONCE(vops->read_merkle_tree_block &&
> > +		     !vops->drop_merkle_tree_block);
> > +
> > +	return vops->read_merkle_tree_block != NULL;
> > +}
> > +
> > +static inline bool fsverity_uses_bitmap(const struct fsverity_info *vi,
> > +					const struct inode *inode)
> > +{
> > +	if (fsverity_caches_blocks(inode))
> > +		return false;
> > +
> > +	/*
> > +	 * If fs uses block-based Merkle tree caching, then fs-verity must use
> > +	 * hash_block_verified bitmap as there's no page to mark it with
> > +	 * PG_checked.
> > +	 */
> > +	return vi->tree_params.block_size != PAGE_SIZE;
> > +}
> 
> For checking whether the bitmap is in use, it's simpler and more efficient to
> just directly check whether vi->hash_block_verified is NULL, as the existing
> code does.  Only the code that allocates the bitmap actually needs to be aware
> of the details of when the bitmap gets enabled.
> 
> fsverity_caches_blocks() has a similar issue, where it could just be replaced
> with checking vops->read_merkle_tree_block != NULL directly (or equivalently
> vops->drop_merkle_tree_block, which works well in
> fsverity_drop_merkle_tree_block() since that's the function pointer it's
> calling).  The WARN_ON_ONCE() should be done in fsverity_create_info(), not
> inlined multiple times into the verification code.

Ok, I'll move the WARN_ON_ONCE there:

	/*
	 * If the filesystem implementation supplies Merkle tree content on a
	 * per-block basis, it must implement both the read and drop functions.
	 * If it supplies content on a per-page basis, neither should be
	 * provided.
	 */
	if (vops->read_merkle_tree_block)
		WARN_ON_ONCE(vops->drop_merkle_tree_block == NULL);
	else
		WARN_ON_ONCE(vops->drop_merkle_tree_block != NULL);


> (I think the name "fsverity_caches_blocks()" is also confusing because the
> fsverity support layer does not cache blocks, but all the filesystems do.  What
> it's actually trying to convey is a difference in how the filesystem caches the
> blocks, which I don't think it does a good job at.)

Ok, I'll drop both and open-code them with nicer comments.

> > +int fsverity_read_merkle_tree_block(struct inode *inode,
> > +				    const struct merkle_tree_params *params,
> > +				    u64 pos, unsigned long ra_bytes,
> > +				    struct fsverity_blockbuf *block);
> > +
> > +/*
> > + * Drop 'block' obtained with ->read_merkle_tree_block(). Calls out back to
> > + * filesystem if ->drop_merkle_tree_block() is set, otherwise, drop the
> > + * reference in the block->context.
> 
> "drop the reference" => "drops the reference"

Fixed.

> > +	/*
> > +	 * If fs uses block-based Merkle tree cachin, then fs-verity must use
> > +	 * hash_block_verified bitmap as there's no page to mark it with
> > +	 * PG_checked.
> > +	 */
> > +	if (fsverity_uses_bitmap(vi, inode)) {
> 
> The comment contradicts the code.

Urrrk, yes, it does. Sorry about that.

	/*
	 * If the fs supplies Merkle tree blocks on a per-page basis and the
	 * page size doesn't match the block size, fs-verity must use the
	 * hash_block_verified bitmap instead of PG_checked.
	 */
	if (vops->read_merkle_tree_block == NULL &&
	    vi->tree_params.block_size != PAGE_SIZE) {

		...allocate bitmap...
	}

> 
> > diff --git a/fs/verity/read_metadata.c b/fs/verity/read_metadata.c
> > index f58432772d9ea..94fffa060f829 100644
> > --- a/fs/verity/read_metadata.c
> > +++ b/fs/verity/read_metadata.c
> > @@ -14,65 +14,60 @@
> >  
> >  static int fsverity_read_merkle_tree(struct inode *inode,
> >  				     const struct fsverity_info *vi,
> > -				     void __user *buf, u64 offset, int length)
> > +				     void __user *buf, u64 pos, int length)
> >  {
> > -	const struct fsverity_operations *vops = inode->i_sb->s_vop;
> > -	u64 end_offset;
> > -	unsigned int offs_in_page;
> > -	pgoff_t index, last_index;
> > +	const u64 end_pos = min(pos + length, vi->tree_params.tree_size);
> > +	const struct merkle_tree_params *params = &vi->tree_params;
> > +	unsigned int offs_in_block = pos & (params->block_size - 1);
> >  	int retval = 0;
> >  	int err = 0;
> >  
> > -	end_offset = min(offset + length, vi->tree_params.tree_size);
> > -	if (offset >= end_offset)
> > +	if (pos >= end_pos)
> >  		return 0;
> 
> The above 'pos >= end_pos' check is no longer necessary.

Aha, good catch.

> >  	/*
> > -	 * Iterate through each Merkle tree page in the requested range and copy
> > -	 * the requested portion to userspace.  Note that the Merkle tree block
> > -	 * size isn't important here, as we are returning a byte stream; i.e.,
> > -	 * we can just work with pages even if the tree block size != PAGE_SIZE.
> > +	 * Iterate through each Merkle tree block in the requested range and
> > +	 * copy the requested portion to userspace. Note that we are returning
> > +	 * a byte stream.
> >  	 */
> > -	for (index = offset >> PAGE_SHIFT; index <= last_index; index++) {
> > -		unsigned long num_ra_pages =
> > -			min_t(unsigned long, last_index - index + 1,
> > -			      inode->i_sb->s_bdi->io_pages);
> > -		unsigned int bytes_to_copy = min_t(u64, end_offset - offset,
> > -						   PAGE_SIZE - offs_in_page);
> > -		struct page *page;
> > -		const void *virt;
> > +	while (pos < end_pos) {
> > +		unsigned long ra_bytes;
> > +		unsigned int bytes_to_copy;
> > +		struct fsverity_blockbuf block = {
> > +			.size = params->block_size,
> > +		};
> >  
> > -		page = vops->read_merkle_tree_page(inode, index, num_ra_pages);
> > -		if (IS_ERR(page)) {
> > -			err = PTR_ERR(page);
> > +		ra_bytes = min_t(unsigned long, end_pos - pos + 1,
> > +				 inode->i_sb->s_bdi->io_pages << PAGE_SHIFT);
> 
> This introduces an off-by-one error in the calculation of ra_bytes.  end_pos
> is exclusive, but the calculation of ra_bytes assumes it is inclusive.
> 
> Also, might io_pages << PAGE_SHIFT overflow an unsigned long?

I hope not -- max io and readahead should never be 4GB.  I'm a little
surprised this isn't s_bdi->ra_pages.

	struct backing_dev_info *bdi = inode->i_sb->s_bdi;
	const u64 max_ra_bytes = min((u64)bdi->io_pages << PAGE_SHIFT,
				     ULONG_MAX);

	while (pos < end_pos) {
		ra_bytes = min_t(unsigned long, end_pos - pos, max_ra_bytes);

> > +		bytes_to_copy = min_t(u64, end_pos - pos,
> > +				      params->block_size - offs_in_block);
> > +
> > +		err = fsverity_read_merkle_tree_block(inode, &vi->tree_params,
> > +				pos - offs_in_block, ra_bytes, &block);
> > +		if (err) {
> >  			fsverity_err(inode,
> > -				     "Error %d reading Merkle tree page %lu",
> > -				     err, index);
> > +				     "Error %d reading Merkle tree block %llu",
> > +				     err, pos);
> >  			break;
> 
> The error message should go into fsverity_read_merkle_tree_block() so that it
> does not need to be duplicated in its two callers.  This would, additionally,
> eliminate the need to introduce the 'err' variable in verify_data_block().

Do you want that to be a separate cleanup patch?

> > diff --git a/fs/verity/verify.c b/fs/verity/verify.c
> > index 4fcad0825a120..0b5e11073e883 100644
> > --- a/fs/verity/verify.c
> > +++ b/fs/verity/verify.c
> > @@ -13,14 +13,27 @@
> >  static struct workqueue_struct *fsverity_read_workqueue;
> >  
> >  /*
> > - * Returns true if the hash block with index @hblock_idx in the tree, located in
> > - * @hpage, has already been verified.
> > + * Returns true if the hash block with index @hblock_idx in the tree has
> > + * already been verified.
> >   */
> > -static bool is_hash_block_verified(struct fsverity_info *vi, struct page *hpage,
> > +static bool is_hash_block_verified(struct inode *inode,
> > +				   struct fsverity_blockbuf *block,
> >  				   unsigned long hblock_idx)
> 
> The comment should be updated to mention @block.

I'll mention @inode too:

/*
 * Returns true if the hash @block with index @hblock_idx in the merkle tree
 * for @inode has already been verified.
 */


> > @@ -143,11 +156,9 @@ verify_data_block(struct inode *inode, struct fsverity_info *vi,
> >  	for (level = 0; level < params->num_levels; level++) {
> >  		unsigned long next_hidx;
> >  		unsigned long hblock_idx;
> > -		pgoff_t hpage_idx;
> > -		unsigned int hblock_offset_in_page;
> > +		u64 hblock_pos;
> >  		unsigned int hoffset;
> > -		struct page *hpage;
> > -		const void *haddr;
> > +		struct fsverity_blockbuf *block = &hblocks[level].block;
> >  
> >  		/*
> >  		 * The index of the block in the current level; also the index
> > @@ -158,36 +169,34 @@ verify_data_block(struct inode *inode, struct fsverity_info *vi,
> >  		/* Index of the hash block in the tree overall */
> >  		hblock_idx = params->level_start[level] + next_hidx;
> >  
> > -		/* Index of the hash page in the tree overall */
> > -		hpage_idx = hblock_idx >> params->log_blocks_per_page;
> > -
> > -		/* Byte offset of the hash block within the page */
> > -		hblock_offset_in_page =
> > -			(hblock_idx << params->log_blocksize) & ~PAGE_MASK;
> > +		/* Byte offset of the hash block in the tree overall */
> > +		hblock_pos = hblock_idx << params->log_blocksize;
> 
> 'hblock_idx << params->log_blocksize' may overflow an unsigned long, so
> 'hblock_idx' needs to be cast to u64 before doing the shift.

Fixed.

> > +		if (level == 0)
> > +			ra_bytes = min(max_ra_bytes,
> > +				       params->tree_size - hblock_pos);
> > +		else
> > +			ra_bytes = 0;
> 
> The first argument to min() has type unsigned long, and the second has type u64.
> Doesn't this generate a warning on 32-bit systems?

Probably.  I'll turn it into a min_t(u64...).

> > @@ -325,7 +333,7 @@ void fsverity_verify_bio(struct bio *bio)
> >  
> >  	bio_for_each_folio_all(fi, bio) {
> >  		if (!verify_data_blocks(fi.folio, fi.length, fi.offset,
> > -					max_ra_pages)) {
> > +					max_ra_pages << PAGE_SHIFT)) {
> >  			bio->bi_status = BLK_STS_IOERR;
> >  			break;
> 
> This function should calculate max_ra_bytes as bio->bi_iter.bi_size >> 2.  It's
> not necessary to convert to pages, then back to bytes again.

Aha, Andrey and I missed that.  Fixed.

> > +/**
> > + * fsverity_read_merkle_tree_block() - read Merkle tree block
> > + * @inode: inode to which this Merkle tree blocks belong
> 
> "this Merkle tree blocks belong" => "this Merkle tree block belongs"
> 
> > + * @params: merkle tree parameters
> > + * @pos: byte position within merkle tree
> > + * @ra_bytes: try to read ahead this many btes
> 
> "btes" => "bytes"

Both fixed.

> > +int fsverity_read_merkle_tree_block(struct inode *inode,
> > +				    const struct merkle_tree_params *params,
> > +				    u64 pos, unsigned long ra_bytes,
> > +				    struct fsverity_blockbuf *block)
> > +{
> > +	const struct fsverity_operations *vops = inode->i_sb->s_vop;
> > +	unsigned long page_idx;
> > +	struct page *page;
> > +	unsigned long index;
> > +	unsigned int offset_in_page;
> > +
> > +	if (fsverity_caches_blocks(inode)) {
> > +		block->verified = false;
> > +		return vops->read_merkle_tree_block(inode, pos, ra_bytes,
> > +				params->log_blocksize, block);
> > +	}
> > +
> > +	index = pos >> params->log_blocksize;
> 
> Should the fourth parameter to ->read_merkle_tree_block be the block index
> (which is computed above) instead of log_blocksize?  XFS only uses
> params->log_blocksize to compute the block index anyway.

I don't know.  XFS is infamous for everyone having a different opinion
and developers being unable to drive a consensus and being able to get a
patchset to completion.  So:

Andrey wrote an implementation that used the buffer cache and used block
indexes to load/store from the xattr structure.  I didn't like that
because layering violation.

willy suggested hanging an xarray off struct xfs_inode and using that to
cache merkle tree blocks.  For that design, we want the xarray indexes
in units of blocks to conserve space, which means passing in (pos,
log_blocksize) or a direct block index.

Christoph then suggested using a per-AG rhashtable to reduce per-inode
memory overhead, in which case the lookup key can be either (inumber,
pos) or (inumber, block index).  This is a better design if there aren't
really going to be that many xfs inodes with verity enabled, though we
lose per-inode sharding of the merkle blocks.

Dave thinks that verity inodes could constitute the majority of xfs
inodes some day and it should be in the inode instead.

Andrey and I do not have crystal balls, we have no idea how this
dchinner/hch disagreement will play out.  Earlier I got the sense that
you wanted to move towards expressing all the merkle tree info in units
of bytes.

In a lot of ways it would be preferable to move to block indexes
instead, since there's never going to be a meaningful access to merkle
position 1337.  But you're the maintainer, so it's up to you. :)

> > +/**
> > + * struct fsverity_blockbuf - Merkle Tree block buffer
> > + * @kaddr: virtual address of the block's data
> > + * @offset: block's offset into Merkle tree
> > + * @size: the Merkle tree block size
> > + * @context: filesystem private context
> > + * @verified: has this buffer been validated?
> > + *
> > + * Buffer containing single Merkle Tree block. These buffers are passed
> > + *  - to filesystem, when fs-verity is building merkel tree,
> > + *  - from filesystem, when fs-verity is reading merkle tree from a disk.
> > + * Filesystems sets kaddr together with size to point to a memory which contains
> > + * Merkle tree block. Same is done by fs-verity when Merkle tree is need to be
> > + * written down to disk.
> 
> This comment still incorrectly claims that fsverity_blockbuf is being used for
> writes.

Oops, I forgot to delete that.

--D

> 
> - Eric
>
Eric Biggers April 24, 2024, 10:08 p.m. UTC | #3
On Wed, Apr 24, 2024 at 02:25:23PM -0700, Darrick J. Wong wrote:
> > For checking whether the bitmap is in use, it's simpler and more efficient to
> > just directly check whether vi->hash_block_verified is NULL, as the existing
> > code does.  Only the code that allocates the bitmap actually needs to be aware
> > of the details of when the bitmap gets enabled.
> > 
> > fsverity_caches_blocks() has a similar issue, where it could just be replaced
> > with checking vops->read_merkle_tree_block != NULL directly (or equivalently
> > vops->drop_merkle_tree_block, which works well in
> > fsverity_drop_merkle_tree_block() since that's the function pointer it's
> > calling).  The WARN_ON_ONCE() should be done in fsverity_create_info(), not
> > inlined multiple times into the verification code.
> 
> Ok, I'll move the WARN_ON_ONCE there:
> 
> 	/*
> 	 * If the filesystem implementation supplies Merkle tree content on a
> 	 * per-block basis, it must implement both the read and drop functions.
> 	 * If it supplies content on a per-page basis, neither should be
> 	 * provided.
> 	 */
> 	if (vops->read_merkle_tree_block)
> 		WARN_ON_ONCE(vops->drop_merkle_tree_block == NULL);
> 	else
> 		WARN_ON_ONCE(vops->drop_merkle_tree_block != NULL);

Checking ->read_merkle_tree_page would make sense too, right?  I.e. we should
enforce that one of the following two cases holds:

    1. read_merkle_tree_page != NULL &&
       read_merkle_tree_block == NULL &&
       drop_merkle_tree_block == NULL

    2. read_merkle_tree_page == NULL &&
       read_merkle_tree_block != NULL &&
       drop_merkle_tree_block != NULL

> > > +		bytes_to_copy = min_t(u64, end_pos - pos,
> > > +				      params->block_size - offs_in_block);
> > > +
> > > +		err = fsverity_read_merkle_tree_block(inode, &vi->tree_params,
> > > +				pos - offs_in_block, ra_bytes, &block);
> > > +		if (err) {
> > >  			fsverity_err(inode,
> > > -				     "Error %d reading Merkle tree page %lu",
> > > -				     err, index);
> > > +				     "Error %d reading Merkle tree block %llu",
> > > +				     err, pos);
> > >  			break;
> > 
> > The error message should go into fsverity_read_merkle_tree_block() so that it
> > does not need to be duplicated in its two callers.  This would, additionally,
> > eliminate the need to introduce the 'err' variable in verify_data_block().
> 
> Do you want that to be a separate cleanup patch?

I think it makes sense to do it in this patch, since this patch introduces
fsverity_read_merkle_tree_block().  This patch also adds the 'err' variable back
to verify_data_block(), which may mislead people into thinking it's the actual
error code that users see (this has happened before) --- that could be avoided.

> > > +int fsverity_read_merkle_tree_block(struct inode *inode,
> > > +				    const struct merkle_tree_params *params,
> > > +				    u64 pos, unsigned long ra_bytes,
> > > +				    struct fsverity_blockbuf *block)
> > > +{
> > > +	const struct fsverity_operations *vops = inode->i_sb->s_vop;
> > > +	unsigned long page_idx;
> > > +	struct page *page;
> > > +	unsigned long index;
> > > +	unsigned int offset_in_page;
> > > +
> > > +	if (fsverity_caches_blocks(inode)) {
> > > +		block->verified = false;
> > > +		return vops->read_merkle_tree_block(inode, pos, ra_bytes,
> > > +				params->log_blocksize, block);
> > > +	}
> > > +
> > > +	index = pos >> params->log_blocksize;
> > 
> > Should the fourth parameter to ->read_merkle_tree_block be the block index
> > (which is computed above) instead of log_blocksize?  XFS only uses
> > params->log_blocksize to compute the block index anyway.
> 
> I don't know.  XFS is infamous for everyone having a different opinion
> and developers being unable to drive a consensus and being able to get a
> patchset to completion.  So:
> 
> Andrey wrote an implementation that used the buffer cache and used block
> indexes to load/store from the xattr structure.  I didn't like that
> because layering violation.
> 
> willy suggested hanging an xarray off struct xfs_inode and using that to
> cache merkle tree blocks.  For that design, we want the xarray indexes
> in units of blocks to conserve space, which means passing in (pos,
> log_blocksize) or a direct block index.
> 
> Christoph then suggested using a per-AG rhashtable to reduce per-inode
> memory overhead, in which case the lookup key can be either (inumber,
> pos) or (inumber, block index).  This is a better design if there aren't
> really going to be that many xfs inodes with verity enabled, though we
> lose per-inode sharding of the merkle blocks.
> 
> Dave thinks that verity inodes could constitute the majority of xfs
> inodes some day and it should be in the inode instead.
> 
> Andrey and I do not have crystal balls, we have no idea how this
> dchinner/hch disagreement will play out.  Earlier I got the sense that
> you wanted to move towards expressing all the merkle tree info in units
> of bytes.
> 
> In a lot of ways it would be preferable to move to block indexes
> instead, since there's never going to be a meaningful access to merkle
> position 1337.  But you're the maintainer, so it's up to you. :)

It makes sense to cache the blocks by index regardless of whether you get passed
the index directly.  So this is just a question of what the
->read_merkle_tree_block function prototype should be.

Currently the other fsverity_operations functions are just aligned with what the
filesystems actually need: ->read_merkle_tree_page takes a page index, whereas
->write_merkle_tree_block takes a pos and size.  xfs_fsverity_read_merkle()
needs pos, index, and size, and fsverity_read_merkle_tree_block() computes the
index already.  So that's why I'd probably go with pos, index, and size instead
of pos, log2_blocksize, and size.  There's no right answer though.

- Eric
Darrick J. Wong April 25, 2024, 12:27 a.m. UTC | #4
On Wed, Apr 24, 2024 at 10:08:58PM +0000, Eric Biggers wrote:
> On Wed, Apr 24, 2024 at 02:25:23PM -0700, Darrick J. Wong wrote:
> > > For checking whether the bitmap is in use, it's simpler and more efficient to
> > > just directly check whether vi->hash_block_verified is NULL, as the existing
> > > code does.  Only the code that allocates the bitmap actually needs to be aware
> > > of the details of when the bitmap gets enabled.
> > > 
> > > fsverity_caches_blocks() has a similar issue, where it could just be replaced
> > > with checking vops->read_merkle_tree_block != NULL directly (or equivalently
> > > vops->drop_merkle_tree_block, which works well in
> > > fsverity_drop_merkle_tree_block() since that's the function pointer it's
> > > calling).  The WARN_ON_ONCE() should be done in fsverity_create_info(), not
> > > inlined multiple times into the verification code.
> > 
> > Ok, I'll move the WARN_ON_ONCE there:
> > 
> > 	/*
> > 	 * If the filesystem implementation supplies Merkle tree content on a
> > 	 * per-block basis, it must implement both the read and drop functions.
> > 	 * If it supplies content on a per-page basis, neither should be
> > 	 * provided.
> > 	 */
> > 	if (vops->read_merkle_tree_block)
> > 		WARN_ON_ONCE(vops->drop_merkle_tree_block == NULL);
> > 	else
> > 		WARN_ON_ONCE(vops->drop_merkle_tree_block != NULL);
> 
> Checking ->read_merkle_tree_page would make sense too, right?  I.e. we should
> enforce that one of the following two cases holds:
> 
>     1. read_merkle_tree_page != NULL &&
>        read_merkle_tree_block == NULL &&
>        drop_merkle_tree_block == NULL
> 
>     2. read_merkle_tree_page == NULL &&
>        read_merkle_tree_block != NULL &&
>        drop_merkle_tree_block != NULL

<nod>

	if (vops->read_merkle_tree_page)
		WARN_ON_ONCE(vops->read_merkle_tree_block != NULL ||
			     vops->drop_merkle_tree_block != NULL);
	else
		WARN_ON_ONCE(vops->read_merkle_tree_block == NULL ||
			     vops->drop_merkle_tree_block == NULL);


> > > > +		bytes_to_copy = min_t(u64, end_pos - pos,
> > > > +				      params->block_size - offs_in_block);
> > > > +
> > > > +		err = fsverity_read_merkle_tree_block(inode, &vi->tree_params,
> > > > +				pos - offs_in_block, ra_bytes, &block);
> > > > +		if (err) {
> > > >  			fsverity_err(inode,
> > > > -				     "Error %d reading Merkle tree page %lu",
> > > > -				     err, index);
> > > > +				     "Error %d reading Merkle tree block %llu",
> > > > +				     err, pos);
> > > >  			break;
> > > 
> > > The error message should go into fsverity_read_merkle_tree_block() so that it
> > > does not need to be duplicated in its two callers.  This would, additionally,
> > > eliminate the need to introduce the 'err' variable in verify_data_block().
> > 
> > Do you want that to be a separate cleanup patch?
> 
> I think it makes sense to do it in this patch, since this patch introduces
> fsverity_read_merkle_tree_block().  This patch also adds the 'err' variable back
> to verify_data_block(), which may mislead people into thinking it's the actual
> error code that users see (this has happened before) --- that could be avoided.

Let's rename it "bad" in verify.c then.  I'm assuming that you /do/ want
the error code in fsverity_read_merkle_tree?

> > > > +int fsverity_read_merkle_tree_block(struct inode *inode,
> > > > +				    const struct merkle_tree_params *params,
> > > > +				    u64 pos, unsigned long ra_bytes,
> > > > +				    struct fsverity_blockbuf *block)
> > > > +{
> > > > +	const struct fsverity_operations *vops = inode->i_sb->s_vop;
> > > > +	unsigned long page_idx;
> > > > +	struct page *page;
> > > > +	unsigned long index;
> > > > +	unsigned int offset_in_page;
> > > > +
> > > > +	if (fsverity_caches_blocks(inode)) {
> > > > +		block->verified = false;
> > > > +		return vops->read_merkle_tree_block(inode, pos, ra_bytes,
> > > > +				params->log_blocksize, block);
> > > > +	}
> > > > +
> > > > +	index = pos >> params->log_blocksize;
> > > 
> > > Should the fourth parameter to ->read_merkle_tree_block be the block index
> > > (which is computed above) instead of log_blocksize?  XFS only uses
> > > params->log_blocksize to compute the block index anyway.
> > 
> > I don't know.  XFS is infamous for everyone having a different opinion
> > and developers being unable to drive a consensus and being able to get a
> > patchset to completion.  So:
> > 
> > Andrey wrote an implementation that used the buffer cache and used block
> > indexes to load/store from the xattr structure.  I didn't like that
> > because layering violation.
> > 
> > willy suggested hanging an xarray off struct xfs_inode and using that to
> > cache merkle tree blocks.  For that design, we want the xarray indexes
> > in units of blocks to conserve space, which means passing in (pos,
> > log_blocksize) or a direct block index.
> > 
> > Christoph then suggested using a per-AG rhashtable to reduce per-inode
> > memory overhead, in which case the lookup key can be either (inumber,
> > pos) or (inumber, block index).  This is a better design if there aren't
> > really going to be that many xfs inodes with verity enabled, though we
> > lose per-inode sharding of the merkle blocks.
> > 
> > Dave thinks that verity inodes could constitute the majority of xfs
> > inodes some day and it should be in the inode instead.
> > 
> > Andrey and I do not have crystal balls, we have no idea how this
> > dchinner/hch disagreement will play out.  Earlier I got the sense that
> > you wanted to move towards expressing all the merkle tree info in units
> > of bytes.
> > 
> > In a lot of ways it would be preferable to move to block indexes
> > instead, since there's never going to be a meaningful access to merkle
> > position 1337.  But you're the maintainer, so it's up to you. :)
> 
> It makes sense to cache the blocks by index regardless of whether you get passed
> the index directly.  So this is just a question of what the
> ->read_merkle_tree_block function prototype should be.
> 
> Currently the other fsverity_operations functions are just aligned with what the
> filesystems actually need: ->read_merkle_tree_page takes a page index, whereas
> ->write_merkle_tree_block takes a pos and size.  xfs_fsverity_read_merkle()
> needs pos, index, and size, and fsverity_read_merkle_tree_block() computes the
> index already.  So that's why I'd probably go with pos, index, and size instead
> of pos, log2_blocksize, and size.  There's no right answer though.

How about just index and size?  I think XFS will work just fine without
the byte position since it's using that as a key for the cache and the
ondisk xattrs anyway.  Even if we can't decide on which data structure
to use for caching. ;)

--D

> 
> - Eric
>
Eric Biggers April 25, 2024, 12:46 a.m. UTC | #5
On Wed, Apr 24, 2024 at 05:27:06PM -0700, Darrick J. Wong wrote:
> > > > > +		err = fsverity_read_merkle_tree_block(inode, &vi->tree_params,
> > > > > +				pos - offs_in_block, ra_bytes, &block);
> > > > > +		if (err) {
> > > > >  			fsverity_err(inode,
> > > > > -				     "Error %d reading Merkle tree page %lu",
> > > > > -				     err, index);
> > > > > +				     "Error %d reading Merkle tree block %llu",
> > > > > +				     err, pos);
> > > > >  			break;
> > > > 
> > > > The error message should go into fsverity_read_merkle_tree_block() so that it
> > > > does not need to be duplicated in its two callers.  This would, additionally,
> > > > eliminate the need to introduce the 'err' variable in verify_data_block().
> > > 
> > > Do you want that to be a separate cleanup patch?
> > 
> > I think it makes sense to do it in this patch, since this patch introduces
> > fsverity_read_merkle_tree_block().  This patch also adds the 'err' variable back
> > to verify_data_block(), which may mislead people into thinking it's the actual
> > error code that users see (this has happened before) --- that could be avoided.
> 
> Let's rename it "bad" in verify.c then.

Anything wrong with just 'if (fsverity_read_merkle_tree_block(...) != 0) {'?

> I'm assuming that you /do/ want the error code in fsverity_read_merkle_tree?

Yes, that's needed to log it.

> > It makes sense to cache the blocks by index regardless of whether you get passed
> > the index directly.  So this is just a question of what the
> > ->read_merkle_tree_block function prototype should be.
> > 
> > Currently the other fsverity_operations functions are just aligned with what the
> > filesystems actually need: ->read_merkle_tree_page takes a page index, whereas
> > ->write_merkle_tree_block takes a pos and size.  xfs_fsverity_read_merkle()
> > needs pos, index, and size, and fsverity_read_merkle_tree_block() computes the
> > index already.  So that's why I'd probably go with pos, index, and size instead
> > of pos, log2_blocksize, and size.  There's no right answer though.
> 
> How about just index and size?  I think XFS will work just fine without
> the byte position since it's using that as a key for the cache and the
> ondisk xattrs anyway.  Even if we can't decide on which data structure
> to use for caching. ;)

Currently XFS is using the byte position as the key for the ondisk xattrs.  That
could be changed if you want it to be the block index instead, though.

(I also notice you're using big endian, which isn't a great choice these days.
But I assume it's for consistency with the rest of XFS.)

- Eric
Darrick J. Wong April 25, 2024, 12:53 a.m. UTC | #6
On Thu, Apr 25, 2024 at 12:46:26AM +0000, Eric Biggers wrote:
> On Wed, Apr 24, 2024 at 05:27:06PM -0700, Darrick J. Wong wrote:
> > > > > > +		err = fsverity_read_merkle_tree_block(inode, &vi->tree_params,
> > > > > > +				pos - offs_in_block, ra_bytes, &block);
> > > > > > +		if (err) {
> > > > > >  			fsverity_err(inode,
> > > > > > -				     "Error %d reading Merkle tree page %lu",
> > > > > > -				     err, index);
> > > > > > +				     "Error %d reading Merkle tree block %llu",
> > > > > > +				     err, pos);
> > > > > >  			break;
> > > > > 
> > > > > The error message should go into fsverity_read_merkle_tree_block() so that it
> > > > > does not need to be duplicated in its two callers.  This would, additionally,
> > > > > eliminate the need to introduce the 'err' variable in verify_data_block().
> > > > 
> > > > Do you want that to be a separate cleanup patch?
> > > 
> > > I think it makes sense to do it in this patch, since this patch introduces
> > > fsverity_read_merkle_tree_block().  This patch also adds the 'err' variable back
> > > to verify_data_block(), which may mislead people into thinking it's the actual
> > > error code that users see (this has happened before) --- that could be avoided.
> > 
> > Let's rename it "bad" in verify.c then.
> 
> Anything wrong with just 'if (fsverity_read_merkle_tree_block(...) != 0) {'?

Nope.

> > I'm assuming that you /do/ want the error code in fsverity_read_merkle_tree?
> 
> Yes, that's needed to log it.

Got it.

> > > It makes sense to cache the blocks by index regardless of whether you get passed
> > > the index directly.  So this is just a question of what the
> > > ->read_merkle_tree_block function prototype should be.
> > > 
> > > Currently the other fsverity_operations functions are just aligned with what the
> > > filesystems actually need: ->read_merkle_tree_page takes a page index, whereas
> > > ->write_merkle_tree_block takes a pos and size.  xfs_fsverity_read_merkle()
> > > needs pos, index, and size, and fsverity_read_merkle_tree_block() computes the
> > > index already.  So that's why I'd probably go with pos, index, and size instead
> > > of pos, log2_blocksize, and size.  There's no right answer though.
> > 
> > How about just index and size?  I think XFS will work just fine without
> > the byte position since it's using that as a key for the cache and the
> > ondisk xattrs anyway.  Even if we can't decide on which data structure
> > to use for caching. ;)
> 
> Currently XFS is using the byte position as the key for the ondisk xattrs.  That
> could be changed if you want it to be the block index instead, though.

/me slaps forehead

Yeah, I forgot that the ondisk format uses the byte position, it's only
the caching layer that cares about the block index.  And it doesn't even
have to do that, the per-AG rhashtable is working fine and doesn't
impose a 16-byte expansion on every xfs_inode everywhere.

> (I also notice you're using big endian, which isn't a great choice these days.
> But I assume it's for consistency with the rest of XFS.)

Yep.

--D

> - Eric
>
diff mbox series

Patch

diff --git a/fs/verity/fsverity_private.h b/fs/verity/fsverity_private.h
index b3506f56e180b..c9d97c2bebd84 100644
--- a/fs/verity/fsverity_private.h
+++ b/fs/verity/fsverity_private.h
@@ -154,4 +154,41 @@  static inline void fsverity_init_signature(void)
 
 void __init fsverity_init_workqueue(void);
 
+static inline bool fsverity_caches_blocks(const struct inode *inode)
+{
+	const struct fsverity_operations *vops = inode->i_sb->s_vop;
+
+	WARN_ON_ONCE(vops->read_merkle_tree_block &&
+		     !vops->drop_merkle_tree_block);
+
+	return vops->read_merkle_tree_block != NULL;
+}
+
+static inline bool fsverity_uses_bitmap(const struct fsverity_info *vi,
+					const struct inode *inode)
+{
+	if (fsverity_caches_blocks(inode))
+		return false;
+
+	/*
+	 * If fs uses block-based Merkle tree caching, then fs-verity must use
+	 * hash_block_verified bitmap as there's no page to mark it with
+	 * PG_checked.
+	 */
+	return vi->tree_params.block_size != PAGE_SIZE;
+}
+
+int fsverity_read_merkle_tree_block(struct inode *inode,
+				    const struct merkle_tree_params *params,
+				    u64 pos, unsigned long ra_bytes,
+				    struct fsverity_blockbuf *block);
+
+/*
+ * Drop 'block' obtained with ->read_merkle_tree_block(). Calls out back to
+ * filesystem if ->drop_merkle_tree_block() is set, otherwise, drop the
+ * reference in the block->context.
+ */
+void fsverity_drop_merkle_tree_block(struct inode *inode,
+				     struct fsverity_blockbuf *block);
+
 #endif /* _FSVERITY_PRIVATE_H */
diff --git a/fs/verity/open.c b/fs/verity/open.c
index fdeb95eca3af3..9603b3a404f74 100644
--- a/fs/verity/open.c
+++ b/fs/verity/open.c
@@ -213,7 +213,12 @@  struct fsverity_info *fsverity_create_info(const struct inode *inode,
 	if (err)
 		goto fail;
 
-	if (vi->tree_params.block_size != PAGE_SIZE) {
+	/*
+	 * If fs uses block-based Merkle tree cachin, then fs-verity must use
+	 * hash_block_verified bitmap as there's no page to mark it with
+	 * PG_checked.
+	 */
+	if (fsverity_uses_bitmap(vi, inode)) {
 		/*
 		 * When the Merkle tree block size and page size differ, we use
 		 * a bitmap to keep track of which hash blocks have been
diff --git a/fs/verity/read_metadata.c b/fs/verity/read_metadata.c
index f58432772d9ea..94fffa060f829 100644
--- a/fs/verity/read_metadata.c
+++ b/fs/verity/read_metadata.c
@@ -14,65 +14,60 @@ 
 
 static int fsverity_read_merkle_tree(struct inode *inode,
 				     const struct fsverity_info *vi,
-				     void __user *buf, u64 offset, int length)
+				     void __user *buf, u64 pos, int length)
 {
-	const struct fsverity_operations *vops = inode->i_sb->s_vop;
-	u64 end_offset;
-	unsigned int offs_in_page;
-	pgoff_t index, last_index;
+	const u64 end_pos = min(pos + length, vi->tree_params.tree_size);
+	const struct merkle_tree_params *params = &vi->tree_params;
+	unsigned int offs_in_block = pos & (params->block_size - 1);
 	int retval = 0;
 	int err = 0;
 
-	end_offset = min(offset + length, vi->tree_params.tree_size);
-	if (offset >= end_offset)
+	if (pos >= end_pos)
 		return 0;
-	offs_in_page = offset_in_page(offset);
-	last_index = (end_offset - 1) >> PAGE_SHIFT;
 
 	/*
-	 * Iterate through each Merkle tree page in the requested range and copy
-	 * the requested portion to userspace.  Note that the Merkle tree block
-	 * size isn't important here, as we are returning a byte stream; i.e.,
-	 * we can just work with pages even if the tree block size != PAGE_SIZE.
+	 * Iterate through each Merkle tree block in the requested range and
+	 * copy the requested portion to userspace. Note that we are returning
+	 * a byte stream.
 	 */
-	for (index = offset >> PAGE_SHIFT; index <= last_index; index++) {
-		unsigned long num_ra_pages =
-			min_t(unsigned long, last_index - index + 1,
-			      inode->i_sb->s_bdi->io_pages);
-		unsigned int bytes_to_copy = min_t(u64, end_offset - offset,
-						   PAGE_SIZE - offs_in_page);
-		struct page *page;
-		const void *virt;
+	while (pos < end_pos) {
+		unsigned long ra_bytes;
+		unsigned int bytes_to_copy;
+		struct fsverity_blockbuf block = {
+			.size = params->block_size,
+		};
 
-		page = vops->read_merkle_tree_page(inode, index, num_ra_pages);
-		if (IS_ERR(page)) {
-			err = PTR_ERR(page);
+		ra_bytes = min_t(unsigned long, end_pos - pos + 1,
+				 inode->i_sb->s_bdi->io_pages << PAGE_SHIFT);
+		bytes_to_copy = min_t(u64, end_pos - pos,
+				      params->block_size - offs_in_block);
+
+		err = fsverity_read_merkle_tree_block(inode, &vi->tree_params,
+				pos - offs_in_block, ra_bytes, &block);
+		if (err) {
 			fsverity_err(inode,
-				     "Error %d reading Merkle tree page %lu",
-				     err, index);
+				     "Error %d reading Merkle tree block %llu",
+				     err, pos);
 			break;
 		}
 
-		virt = kmap_local_page(page);
-		if (copy_to_user(buf, virt + offs_in_page, bytes_to_copy)) {
-			kunmap_local(virt);
-			put_page(page);
+		if (copy_to_user(buf, block.kaddr + offs_in_block, bytes_to_copy)) {
+			fsverity_drop_merkle_tree_block(inode, &block);
 			err = -EFAULT;
 			break;
 		}
-		kunmap_local(virt);
-		put_page(page);
+		fsverity_drop_merkle_tree_block(inode, &block);
 
 		retval += bytes_to_copy;
 		buf += bytes_to_copy;
-		offset += bytes_to_copy;
+		pos += bytes_to_copy;
 
 		if (fatal_signal_pending(current))  {
 			err = -EINTR;
 			break;
 		}
 		cond_resched();
-		offs_in_page = 0;
+		offs_in_block = 0;
 	}
 	return retval ? retval : err;
 }
diff --git a/fs/verity/verify.c b/fs/verity/verify.c
index 4fcad0825a120..0b5e11073e883 100644
--- a/fs/verity/verify.c
+++ b/fs/verity/verify.c
@@ -13,14 +13,27 @@ 
 static struct workqueue_struct *fsverity_read_workqueue;
 
 /*
- * Returns true if the hash block with index @hblock_idx in the tree, located in
- * @hpage, has already been verified.
+ * Returns true if the hash block with index @hblock_idx in the tree has
+ * already been verified.
  */
-static bool is_hash_block_verified(struct fsverity_info *vi, struct page *hpage,
+static bool is_hash_block_verified(struct inode *inode,
+				   struct fsverity_blockbuf *block,
 				   unsigned long hblock_idx)
 {
 	unsigned int blocks_per_page;
 	unsigned int i;
+	struct fsverity_info *vi = inode->i_verity_info;
+	struct page *hpage;
+
+	/*
+	 * If the filesystem uses block-based caching, then rely on the
+	 * implementation to retain verified status.
+	 */
+	if (fsverity_caches_blocks(inode))
+		return block->verified;
+
+	/* Otherwise, the filesystem uses page-based caching. */
+	hpage = (struct page *)block->context;
 
 	/*
 	 * When the Merkle tree block size and page size are the same, then the
@@ -31,7 +44,7 @@  static bool is_hash_block_verified(struct fsverity_info *vi, struct page *hpage,
 	 * get evicted and re-instantiated from the backing storage, as new
 	 * pages always start out with PG_checked cleared.
 	 */
-	if (!vi->hash_block_verified)
+	if (!fsverity_uses_bitmap(vi, inode))
 		return PageChecked(hpage);
 
 	/*
@@ -90,20 +103,20 @@  static bool is_hash_block_verified(struct fsverity_info *vi, struct page *hpage,
  */
 static bool
 verify_data_block(struct inode *inode, struct fsverity_info *vi,
-		  const void *data, u64 data_pos, unsigned long max_ra_pages)
+		  const void *data, u64 data_pos, unsigned long max_ra_bytes)
 {
 	const struct merkle_tree_params *params = &vi->tree_params;
 	const unsigned int hsize = params->digest_size;
 	int level;
+	int err = 0;
+	unsigned long ra_bytes;
 	u8 _want_hash[FS_VERITY_MAX_DIGEST_SIZE];
 	const u8 *want_hash;
 	u8 real_hash[FS_VERITY_MAX_DIGEST_SIZE];
 	/* The hash blocks that are traversed, indexed by level */
 	struct {
-		/* Page containing the hash block */
-		struct page *page;
-		/* Mapped address of the hash block (will be within @page) */
-		const void *addr;
+		/* Buffer containing the hash block */
+		struct fsverity_blockbuf block;
 		/* Index of the hash block in the tree overall */
 		unsigned long index;
 		/* Byte offset of the wanted hash relative to @addr */
@@ -143,11 +156,9 @@  verify_data_block(struct inode *inode, struct fsverity_info *vi,
 	for (level = 0; level < params->num_levels; level++) {
 		unsigned long next_hidx;
 		unsigned long hblock_idx;
-		pgoff_t hpage_idx;
-		unsigned int hblock_offset_in_page;
+		u64 hblock_pos;
 		unsigned int hoffset;
-		struct page *hpage;
-		const void *haddr;
+		struct fsverity_blockbuf *block = &hblocks[level].block;
 
 		/*
 		 * The index of the block in the current level; also the index
@@ -158,36 +169,34 @@  verify_data_block(struct inode *inode, struct fsverity_info *vi,
 		/* Index of the hash block in the tree overall */
 		hblock_idx = params->level_start[level] + next_hidx;
 
-		/* Index of the hash page in the tree overall */
-		hpage_idx = hblock_idx >> params->log_blocks_per_page;
-
-		/* Byte offset of the hash block within the page */
-		hblock_offset_in_page =
-			(hblock_idx << params->log_blocksize) & ~PAGE_MASK;
+		/* Byte offset of the hash block in the tree overall */
+		hblock_pos = hblock_idx << params->log_blocksize;
 
 		/* Byte offset of the hash within the block */
 		hoffset = (hidx << params->log_digestsize) &
 			  (params->block_size - 1);
 
-		hpage = inode->i_sb->s_vop->read_merkle_tree_page(inode,
-				hpage_idx, level == 0 ? min(max_ra_pages,
-					params->tree_pages - hpage_idx) : 0);
-		if (IS_ERR(hpage)) {
+		if (level == 0)
+			ra_bytes = min(max_ra_bytes,
+				       params->tree_size - hblock_pos);
+		else
+			ra_bytes = 0;
+
+		err = fsverity_read_merkle_tree_block(inode, params, hblock_pos,
+				ra_bytes, block);
+		if (err) {
 			fsverity_err(inode,
-				     "Error %ld reading Merkle tree page %lu",
-				     PTR_ERR(hpage), hpage_idx);
+				     "Error %d reading Merkle tree block %llu",
+				     err, hblock_pos);
 			goto error;
 		}
-		haddr = kmap_local_page(hpage) + hblock_offset_in_page;
-		if (is_hash_block_verified(vi, hpage, hblock_idx)) {
-			memcpy(_want_hash, haddr + hoffset, hsize);
+
+		if (is_hash_block_verified(inode, block, hblock_idx)) {
+			memcpy(_want_hash, block->kaddr + hoffset, hsize);
 			want_hash = _want_hash;
-			kunmap_local(haddr);
-			put_page(hpage);
+			fsverity_drop_merkle_tree_block(inode, block);
 			goto descend;
 		}
-		hblocks[level].page = hpage;
-		hblocks[level].addr = haddr;
 		hblocks[level].index = hblock_idx;
 		hblocks[level].hoffset = hoffset;
 		hidx = next_hidx;
@@ -197,8 +206,8 @@  verify_data_block(struct inode *inode, struct fsverity_info *vi,
 descend:
 	/* Descend the tree verifying hash blocks. */
 	for (; level > 0; level--) {
-		struct page *hpage = hblocks[level - 1].page;
-		const void *haddr = hblocks[level - 1].addr;
+		struct fsverity_blockbuf *block = &hblocks[level - 1].block;
+		const void *haddr = block->kaddr;
 		unsigned long hblock_idx = hblocks[level - 1].index;
 		unsigned int hoffset = hblocks[level - 1].hoffset;
 
@@ -211,14 +220,15 @@  verify_data_block(struct inode *inode, struct fsverity_info *vi,
 		 * idempotent, as the same hash block might be verified by
 		 * multiple threads concurrently.
 		 */
-		if (vi->hash_block_verified)
+		if (fsverity_caches_blocks(inode))
+			block->verified = true;
+		else if (fsverity_uses_bitmap(vi, inode))
 			set_bit(hblock_idx, vi->hash_block_verified);
 		else
-			SetPageChecked(hpage);
+			SetPageChecked((struct page *)block->context);
 		memcpy(_want_hash, haddr + hoffset, hsize);
 		want_hash = _want_hash;
-		kunmap_local(haddr);
-		put_page(hpage);
+		fsverity_drop_merkle_tree_block(inode, block);
 	}
 
 	/* Finally, verify the data block. */
@@ -235,16 +245,14 @@  verify_data_block(struct inode *inode, struct fsverity_info *vi,
 		     params->hash_alg->name, hsize, want_hash,
 		     params->hash_alg->name, hsize, real_hash);
 error:
-	for (; level > 0; level--) {
-		kunmap_local(hblocks[level - 1].addr);
-		put_page(hblocks[level - 1].page);
-	}
+	for (; level > 0; level--)
+		fsverity_drop_merkle_tree_block(inode, &hblocks[level - 1].block);
 	return false;
 }
 
 static bool
 verify_data_blocks(struct folio *data_folio, size_t len, size_t offset,
-		   unsigned long max_ra_pages)
+		   unsigned long max_ra_bytes)
 {
 	struct inode *inode = data_folio->mapping->host;
 	struct fsverity_info *vi = inode->i_verity_info;
@@ -262,7 +270,7 @@  verify_data_blocks(struct folio *data_folio, size_t len, size_t offset,
 
 		data = kmap_local_folio(data_folio, offset);
 		valid = verify_data_block(inode, vi, data, pos + offset,
-					  max_ra_pages);
+					  max_ra_bytes);
 		kunmap_local(data);
 		if (!valid)
 			return false;
@@ -325,7 +333,7 @@  void fsverity_verify_bio(struct bio *bio)
 
 	bio_for_each_folio_all(fi, bio) {
 		if (!verify_data_blocks(fi.folio, fi.length, fi.offset,
-					max_ra_pages)) {
+					max_ra_pages << PAGE_SHIFT)) {
 			bio->bi_status = BLK_STS_IOERR;
 			break;
 		}
@@ -362,3 +370,57 @@  void __init fsverity_init_workqueue(void)
 	if (!fsverity_read_workqueue)
 		panic("failed to allocate fsverity_read_queue");
 }
+
+/**
+ * fsverity_read_merkle_tree_block() - read Merkle tree block
+ * @inode: inode to which this Merkle tree blocks belong
+ * @params: merkle tree parameters
+ * @pos: byte position within merkle tree
+ * @ra_bytes: try to read ahead this many btes
+ * @block: block to be loaded
+ *
+ * This function loads data from a merkle tree.
+ */
+int fsverity_read_merkle_tree_block(struct inode *inode,
+				    const struct merkle_tree_params *params,
+				    u64 pos, unsigned long ra_bytes,
+				    struct fsverity_blockbuf *block)
+{
+	const struct fsverity_operations *vops = inode->i_sb->s_vop;
+	unsigned long page_idx;
+	struct page *page;
+	unsigned long index;
+	unsigned int offset_in_page;
+
+	if (fsverity_caches_blocks(inode)) {
+		block->verified = false;
+		return vops->read_merkle_tree_block(inode, pos, ra_bytes,
+				params->log_blocksize, block);
+	}
+
+	index = pos >> params->log_blocksize;
+	page_idx = round_down(index, params->blocks_per_page);
+	offset_in_page = pos & ~PAGE_MASK;
+
+	page = vops->read_merkle_tree_page(inode, page_idx,
+			ra_bytes >> PAGE_SHIFT);
+	if (IS_ERR(page))
+		return PTR_ERR(page);
+
+	block->kaddr = kmap_local_page(page) + offset_in_page;
+	block->context = page;
+	return 0;
+}
+
+void fsverity_drop_merkle_tree_block(struct inode *inode,
+				     struct fsverity_blockbuf *block)
+{
+	if (fsverity_caches_blocks(inode))  {
+		inode->i_sb->s_vop->drop_merkle_tree_block(block);
+	} else {
+		kunmap_local(block->kaddr);
+		put_page((struct page *)block->context);
+	}
+	block->kaddr = NULL;
+	block->context = NULL;
+}
diff --git a/include/linux/fsverity.h b/include/linux/fsverity.h
index ac58b19f23d32..c3478efd67d62 100644
--- a/include/linux/fsverity.h
+++ b/include/linux/fsverity.h
@@ -26,6 +26,35 @@ 
 /* Arbitrary limit to bound the kmalloc() size.  Can be changed. */
 #define FS_VERITY_MAX_DESCRIPTOR_SIZE	16384
 
+/**
+ * struct fsverity_blockbuf - Merkle Tree block buffer
+ * @kaddr: virtual address of the block's data
+ * @offset: block's offset into Merkle tree
+ * @size: the Merkle tree block size
+ * @context: filesystem private context
+ * @verified: has this buffer been validated?
+ *
+ * Buffer containing single Merkle Tree block. These buffers are passed
+ *  - to filesystem, when fs-verity is building merkel tree,
+ *  - from filesystem, when fs-verity is reading merkle tree from a disk.
+ * Filesystems sets kaddr together with size to point to a memory which contains
+ * Merkle tree block. Same is done by fs-verity when Merkle tree is need to be
+ * written down to disk.
+ *
+ * While reading the tree, fs-verity calls ->read_merkle_tree_block followed by
+ * ->drop_merkle_tree_block to let filesystem know that memory can be freed.
+ *
+ * The context is optional. This field can be used by filesystem to passthrough
+ * state from ->read_merkle_tree_block to ->drop_merkle_tree_block.
+ */
+struct fsverity_blockbuf {
+	void *kaddr;
+	u64 offset;
+	unsigned int verified:1;
+	unsigned int size;
+	void *context;
+};
+
 /* Verity operations for filesystems */
 struct fsverity_operations {
 
@@ -101,12 +130,43 @@  struct fsverity_operations {
 	 *
 	 * Note that this must retrieve a *page*, not necessarily a *block*.
 	 *
+	 * If this function is implemented, do not implement
+	 * ->read_merkle_tree_block or ->drop_merkle_tree_block.
+	 *
 	 * Return: the page on success, ERR_PTR() on failure
 	 */
 	struct page *(*read_merkle_tree_page)(struct inode *inode,
 					      pgoff_t index,
 					      unsigned long num_ra_pages);
 
+	/**
+	 * Read a Merkle tree block of the given inode.
+	 * @inode: the inode
+	 * @pos: byte offset of the block within the Merkle tree
+	 * @ra_bytes: The number of bytes that should be
+	 *		prefetched starting at @pos if the page at @pos
+	 *		isn't already cached.  Implementations may ignore this
+	 *		argument; it's only a performance optimization.
+	 * @log_blocksize: log2 of the size of the expected block
+	 * @block: block buffer for filesystem to point it to the block
+	 *
+	 * This can be called at any time on an open verity file.  It may be
+	 * called by multiple processes concurrently.
+	 *
+	 * Implementations may cache the @block->verified state in
+	 * ->drop_merkle_tree_block.  They must clear the @block->verified
+	 * flag for a cache miss.
+	 *
+	 * If this function is implemented, ->drop_merkle_tree_block must also
+	 * be implemented.
+	 *
+	 * Return: 0 on success, -errno on failure
+	 */
+	int (*read_merkle_tree_block)(struct inode *inode,
+				      u64 pos, unsigned long ra_bytes,
+				      unsigned int log_blocksize,
+				      struct fsverity_blockbuf *block);
+
 	/**
 	 * Write a Merkle tree block to the given inode.
 	 *
@@ -122,6 +182,22 @@  struct fsverity_operations {
 	 */
 	int (*write_merkle_tree_block)(struct inode *inode, const void *buf,
 				       u64 pos, unsigned int size);
+
+	/**
+	 * Release the reference to a Merkle tree block
+	 *
+	 * @block: the block to release
+	 *
+	 * This is called when fs-verity is done with a block obtained with
+	 * ->read_merkle_tree_block().
+	 *
+	 * Implementations should cache a @block->verified==1 state to avoid
+	 * unnecessary revalidations during later accesses.
+	 *
+	 * If this function is implemented, ->read_merkle_tree_block must also
+	 * be implemented.
+	 */
+	void (*drop_merkle_tree_block)(struct fsverity_blockbuf *block);
 };
 
 #ifdef CONFIG_FS_VERITY