diff mbox series

[03/18] fsverity: convert verification to use byte instead of page offsets

Message ID 171444679642.955480.14668034329027994356.stgit@frogsfrogsfrogs (mailing list archive)
State New, archived
Headers show
Series [01/18] fs: add FS_XFLAG_VERITY for verity files | expand

Commit Message

Darrick J. Wong April 30, 2024, 3:20 a.m. UTC
From: Darrick J. Wong <djwong@kernel.org>

Convert all the hash verification code to use byte offsets instead of
page offsets so that fsverity can support implementations that supply
merkle tree information in units of merkle tree blocks instead of pages.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/verity/fsverity_private.h |    8 ++
 fs/verity/read_metadata.c    |   65 ++++++++-----------
 fs/verity/verify.c           |  145 ++++++++++++++++++++++++++++--------------
 include/linux/fsverity.h     |   19 ++++++
 4 files changed, 152 insertions(+), 85 deletions(-)

Comments

Christoph Hellwig May 1, 2024, 7:33 a.m. UTC | #1
> +	const u64 end_pos = min(pos + length, vi->tree_params.tree_size);
> +	struct backing_dev_info *bdi = inode->i_sb->s_bdi;
> +	const u64 max_ra_bytes = min((u64)bdi->io_pages << PAGE_SHIFT,
> +				     ULONG_MAX);
> +	const struct merkle_tree_params *params = &vi->tree_params;

bdi->io_pages is really a VM readahead concept.  I know this is existing
code, but can we rething why this is even used here?

> +	unsigned int offs_in_block = pos & (params->block_size - 1);
>  	int retval = 0;
>  	int err = 0;
>  
> +	 * Iterate through each Merkle tree block in the requested range and
> +	 * copy the requested portion to userspace. Note that we are returning
> +	 * a byte stream.
>  	 */
> +	while (pos < end_pos) {
> +		unsigned long ra_bytes;
> +		unsigned int bytes_to_copy;
> +		struct fsverity_blockbuf block = { };
>  
> +		ra_bytes = min_t(unsigned long, end_pos - pos, max_ra_bytes);
> +		bytes_to_copy = min_t(u64, end_pos - pos,
> +				      params->block_size - offs_in_block);
> +
> +		err = fsverity_read_merkle_tree_block(inode, &vi->tree_params,
> +						      pos - offs_in_block,
> +						      ra_bytes, &block);

Maybe it's just me, but isn't passing a byte offset to a read...block
routine a bit weird and this should operate on the block number instead?

> +		if (copy_to_user(buf, block.kaddr + offs_in_block, bytes_to_copy)) {

And the returned/passed value should be a kernel pointer to the start
of the in-memory copy of the block?
to 

> +static bool is_hash_block_verified(struct inode *inode,
> +				   struct fsverity_blockbuf *block,
>  				   unsigned long hblock_idx)

Other fsverify code seems to use the (IMHO) much more readable
two-tab indentation for prototype continuations, maybe stick to that?

>
>  {
> +	struct fsverity_info *vi = inode->i_verity_info;
> +	struct page *hpage = (struct page *)block->context;

block->context is a void pointer, no need for casting it.

> +	for (; level > 0; level--)
> +		fsverity_drop_merkle_tree_block(inode, &hblocks[level - 1].block);

Overlh long line here.  But the loop kinda looks odd anyway with the
exta one off in the body instead of the loop.
Darrick J. Wong May 1, 2024, 10:33 p.m. UTC | #2
On Wed, May 01, 2024 at 12:33:14AM -0700, Christoph Hellwig wrote:
> > +	const u64 end_pos = min(pos + length, vi->tree_params.tree_size);
> > +	struct backing_dev_info *bdi = inode->i_sb->s_bdi;
> > +	const u64 max_ra_bytes = min((u64)bdi->io_pages << PAGE_SHIFT,
> > +				     ULONG_MAX);
> > +	const struct merkle_tree_params *params = &vi->tree_params;
> 
> bdi->io_pages is really a VM readahead concept.  I know this is existing
> code, but can we rething why this is even used here?

I would get rid of it entirely for the merkle-by-block case, since we'd
have to walk the xattr tree again just to find the next block.  XFS
ignores the readahead value entirely.

I think this only makes sense for the merkle-by-page case, and only
because ext4 and friends are stuffing the merkle data in the posteof
parts of the file mapping.

And even then, shouldn't we figure out the amount of readahead going on
and only ask for enough readahead of the merkle tree to satisfy that
readahead?

> > +	unsigned int offs_in_block = pos & (params->block_size - 1);
> >  	int retval = 0;
> >  	int err = 0;
> >  
> > +	 * Iterate through each Merkle tree block in the requested range and
> > +	 * copy the requested portion to userspace. Note that we are returning
> > +	 * a byte stream.
> >  	 */
> > +	while (pos < end_pos) {
> > +		unsigned long ra_bytes;
> > +		unsigned int bytes_to_copy;
> > +		struct fsverity_blockbuf block = { };
> >  
> > +		ra_bytes = min_t(unsigned long, end_pos - pos, max_ra_bytes);
> > +		bytes_to_copy = min_t(u64, end_pos - pos,
> > +				      params->block_size - offs_in_block);
> > +
> > +		err = fsverity_read_merkle_tree_block(inode, &vi->tree_params,
> > +						      pos - offs_in_block,
> > +						      ra_bytes, &block);
> 
> Maybe it's just me, but isn't passing a byte offset to a read...block
> routine a bit weird and this should operate on the block number instead?

I would think so, but here's the thing -- the write_merkle_tree_block
functions get passed pos and length in units of bytes.  Maybe fsverity
should clean be passing (blockno, blocksize) to the read and write
functions?  Eric said he could be persuaded to change it:

https://lore.kernel.org/linux-xfs/20240307224903.GE1799@sol.localdomain/

> > +		if (copy_to_user(buf, block.kaddr + offs_in_block, bytes_to_copy)) {
> 
> And the returned/passed value should be a kernel pointer to the start
> of the in-memory copy of the block?
> to 

<shrug> This particular callsite is reading merkle data on behalf of an
ioctl that exports data.  Maybe we want the filesystem's errors to be
bounced up to userspace?

> > +static bool is_hash_block_verified(struct inode *inode,
> > +				   struct fsverity_blockbuf *block,
> >  				   unsigned long hblock_idx)
> 
> Other fsverify code seems to use the (IMHO) much more readable
> two-tab indentation for prototype continuations, maybe stick to that?

I'll do that, if Eric says so. :)

> >
> >  {
> > +	struct fsverity_info *vi = inode->i_verity_info;
> > +	struct page *hpage = (struct page *)block->context;
> 
> block->context is a void pointer, no need for casting it.

Eric insisted on it:
https://lore.kernel.org/linux-xfs/20240306035622.GA68962@sol.localdomain/

> > +	for (; level > 0; level--)
> > +		fsverity_drop_merkle_tree_block(inode, &hblocks[level - 1].block);
> 
> Overlh long line here.  But the loop kinda looks odd anyway with the
> exta one off in the body instead of the loop.

I /think/ that's a side effect of reusing the value of @level after the
first loop fails as the initial conditions of the unwind loop.  AFAICT
it doesn't leak, but it's not entirely straightforward.

--D
Eric Biggers May 2, 2024, 12:42 a.m. UTC | #3
On Wed, May 01, 2024 at 03:33:03PM -0700, Darrick J. Wong wrote:
> On Wed, May 01, 2024 at 12:33:14AM -0700, Christoph Hellwig wrote:
> > > +	const u64 end_pos = min(pos + length, vi->tree_params.tree_size);
> > > +	struct backing_dev_info *bdi = inode->i_sb->s_bdi;
> > > +	const u64 max_ra_bytes = min((u64)bdi->io_pages << PAGE_SHIFT,
> > > +				     ULONG_MAX);
> > > +	const struct merkle_tree_params *params = &vi->tree_params;
> > 
> > bdi->io_pages is really a VM readahead concept.  I know this is existing
> > code, but can we rething why this is even used here?
> 
> I would get rid of it entirely for the merkle-by-block case, since we'd
> have to walk the xattr tree again just to find the next block.  XFS
> ignores the readahead value entirely.
> 
> I think this only makes sense for the merkle-by-page case, and only
> because ext4 and friends are stuffing the merkle data in the posteof
> parts of the file mapping.
> 
> And even then, shouldn't we figure out the amount of readahead going on
> and only ask for enough readahead of the merkle tree to satisfy that
> readahead?

The existing code is:

                unsigned long num_ra_pages =
                        min_t(unsigned long, last_index - index + 1,
                              inode->i_sb->s_bdi->io_pages);

So it does limit the readahead amount to the amount remaining to be read.

In addition, it's limited to io_pages.  It's possible that's not the best value
to use (maybe it should be ra_pages?), but the intent was to just use a large
readahead size, since this code is doing a fully sequential read.

I do think that the concept of Merkle tree readahead makes sense regardless of
how the blocks are being stored.  Having to go to disk every time a new 4K
Merkle tree block is needed increases read latencies.  It doesn't need to be
included in the initial implementation though.

> > And the returned/passed value should be a kernel pointer to the start
> > of the in-memory copy of the block?
> > to 
> 
> <shrug> This particular callsite is reading merkle data on behalf of an
> ioctl that exports data.  Maybe we want the filesystem's errors to be
> bounced up to userspace?

Yes, I think so.

> > > +static bool is_hash_block_verified(struct inode *inode,
> > > +				   struct fsverity_blockbuf *block,
> > >  				   unsigned long hblock_idx)
> > 
> > Other fsverify code seems to use the (IMHO) much more readable
> > two-tab indentation for prototype continuations, maybe stick to that?
> 
> I'll do that, if Eric says so. :)

My preference is to align continuations with the line that they're continuing:

static bool is_hash_block_verified(struct inode *inode,
				   struct fsverity_blockbuf *block,
				   unsigned long hblock_idx)

> > >
> > >  {
> > > +	struct fsverity_info *vi = inode->i_verity_info;
> > > +	struct page *hpage = (struct page *)block->context;
> > 
> > block->context is a void pointer, no need for casting it.
> 
> Eric insisted on it:
> https://lore.kernel.org/linux-xfs/20240306035622.GA68962@sol.localdomain/

No, I didn't.  It showed up in some code snippets that I suggested, but the
casts originated from the patch itself.  Leaving out the cast is fine with me.

> 
> > > +	for (; level > 0; level--)
> > > +		fsverity_drop_merkle_tree_block(inode, &hblocks[level - 1].block);
> > 
> > Overlh long line here.  But the loop kinda looks odd anyway with the
> > exta one off in the body instead of the loop.
> 
> I /think/ that's a side effect of reusing the value of @level after the
> first loop fails as the initial conditions of the unwind loop.  AFAICT
> it doesn't leak, but it's not entirely straightforward.

When an error occurs either ascending or descending the tree, we end up here
with 'level' containing the number of levels that need to be cleaned up.  It
might be clearer if it was called 'num_levels', though that could be confused
with 'params->num_levels'.  Or we could use: 'while (level-- > 0)'.

This is unrelated to this patch though.

- Eric
Darrick J. Wong May 8, 2024, 8:14 p.m. UTC | #4
On Thu, May 02, 2024 at 12:42:31AM +0000, Eric Biggers wrote:
> On Wed, May 01, 2024 at 03:33:03PM -0700, Darrick J. Wong wrote:
> > On Wed, May 01, 2024 at 12:33:14AM -0700, Christoph Hellwig wrote:
> > > > +	const u64 end_pos = min(pos + length, vi->tree_params.tree_size);
> > > > +	struct backing_dev_info *bdi = inode->i_sb->s_bdi;
> > > > +	const u64 max_ra_bytes = min((u64)bdi->io_pages << PAGE_SHIFT,
> > > > +				     ULONG_MAX);
> > > > +	const struct merkle_tree_params *params = &vi->tree_params;
> > > 
> > > bdi->io_pages is really a VM readahead concept.  I know this is existing
> > > code, but can we rething why this is even used here?
> > 
> > I would get rid of it entirely for the merkle-by-block case, since we'd
> > have to walk the xattr tree again just to find the next block.  XFS
> > ignores the readahead value entirely.
> > 
> > I think this only makes sense for the merkle-by-page case, and only
> > because ext4 and friends are stuffing the merkle data in the posteof
> > parts of the file mapping.
> > 
> > And even then, shouldn't we figure out the amount of readahead going on
> > and only ask for enough readahead of the merkle tree to satisfy that
> > readahead?
> 
> The existing code is:
> 
>                 unsigned long num_ra_pages =
>                         min_t(unsigned long, last_index - index + 1,
>                               inode->i_sb->s_bdi->io_pages);
> 
> So it does limit the readahead amount to the amount remaining to be read.
> 
> In addition, it's limited to io_pages.  It's possible that's not the best value
> to use (maybe it should be ra_pages?), but the intent was to just use a large
> readahead size, since this code is doing a fully sequential read.

io_pages is supposed to be the optimal IO size, whereas ra_pages is the
readahead size for the block device.  I don't know why you chose
io_pages, but I'm assuming there's a reason there. :)

Somewhat confusingly, I think mm/readahead.c picks the maximum of
io_pages and ra_pages, which doesn't clear things up for me either.

Personally I think fsverity should be using ra_pages here, but changing
it should be a different patch with a separate justification.  This
patch simply has to translate the merkle-by-page code to handle by-block.

> I do think that the concept of Merkle tree readahead makes sense regardless of
> how the blocks are being stored.  Having to go to disk every time a new 4K
> Merkle tree block is needed increases read latencies.  It doesn't need to be
> included in the initial implementation though.

Of course, if we're really ok with xfs making a giant left turn and
storing the entire merkle tree as one big chunk of file range in the
attr fork, then suddenly it *does* make sense to allow merkle tree
readahead again.

> > > And the returned/passed value should be a kernel pointer to the start
> > > of the in-memory copy of the block?
> > > to 
> > 
> > <shrug> This particular callsite is reading merkle data on behalf of an
> > ioctl that exports data.  Maybe we want the filesystem's errors to be
> > bounced up to userspace?
> 
> Yes, I think so.

Ok, thanks for confirming that.

> > > > +static bool is_hash_block_verified(struct inode *inode,
> > > > +				   struct fsverity_blockbuf *block,
> > > >  				   unsigned long hblock_idx)
> > > 
> > > Other fsverify code seems to use the (IMHO) much more readable
> > > two-tab indentation for prototype continuations, maybe stick to that?
> > 
> > I'll do that, if Eric says so. :)
> 
> My preference is to align continuations with the line that they're continuing:
> 
> static bool is_hash_block_verified(struct inode *inode,
> 				   struct fsverity_blockbuf *block,
> 				   unsigned long hblock_idx)
> 
> > > >
> > > >  {
> > > > +	struct fsverity_info *vi = inode->i_verity_info;
> > > > +	struct page *hpage = (struct page *)block->context;
> > > 
> > > block->context is a void pointer, no need for casting it.
> > 
> > Eric insisted on it:
> > https://lore.kernel.org/linux-xfs/20240306035622.GA68962@sol.localdomain/
> 
> No, I didn't.  It showed up in some code snippets that I suggested, but the
> casts originated from the patch itself.  Leaving out the cast is fine with me.

Oh ok.  I'll drop those then.

> > 
> > > > +	for (; level > 0; level--)
> > > > +		fsverity_drop_merkle_tree_block(inode, &hblocks[level - 1].block);
> > > 
> > > Overlh long line here.  But the loop kinda looks odd anyway with the
> > > exta one off in the body instead of the loop.
> > 
> > I /think/ that's a side effect of reusing the value of @level after the
> > first loop fails as the initial conditions of the unwind loop.  AFAICT
> > it doesn't leak, but it's not entirely straightforward.
> 
> When an error occurs either ascending or descending the tree, we end up here
> with 'level' containing the number of levels that need to be cleaned up.  It
> might be clearer if it was called 'num_levels', though that could be confused
> with 'params->num_levels'.  Or we could use: 'while (level-- > 0)'.
> 
> This is unrelated to this patch though.

<nod>

--D

> - Eric
>
diff mbox series

Patch

diff --git a/fs/verity/fsverity_private.h b/fs/verity/fsverity_private.h
index b3506f56e180b..8a41e27413284 100644
--- a/fs/verity/fsverity_private.h
+++ b/fs/verity/fsverity_private.h
@@ -154,4 +154,12 @@  static inline void fsverity_init_signature(void)
 
 void __init fsverity_init_workqueue(void);
 
+int fsverity_read_merkle_tree_block(struct inode *inode,
+				    const struct merkle_tree_params *params,
+				    u64 pos, unsigned long ra_bytes,
+				    struct fsverity_blockbuf *block);
+
+void fsverity_drop_merkle_tree_block(struct inode *inode,
+				     struct fsverity_blockbuf *block);
+
 #endif /* _FSVERITY_PRIVATE_H */
diff --git a/fs/verity/read_metadata.c b/fs/verity/read_metadata.c
index f58432772d9ea..4011a02f5d32d 100644
--- a/fs/verity/read_metadata.c
+++ b/fs/verity/read_metadata.c
@@ -14,65 +14,54 @@ 
 
 static int fsverity_read_merkle_tree(struct inode *inode,
 				     const struct fsverity_info *vi,
-				     void __user *buf, u64 offset, int length)
+				     void __user *buf, u64 pos, int length)
 {
-	const struct fsverity_operations *vops = inode->i_sb->s_vop;
-	u64 end_offset;
-	unsigned int offs_in_page;
-	pgoff_t index, last_index;
+	const u64 end_pos = min(pos + length, vi->tree_params.tree_size);
+	struct backing_dev_info *bdi = inode->i_sb->s_bdi;
+	const u64 max_ra_bytes = min((u64)bdi->io_pages << PAGE_SHIFT,
+				     ULONG_MAX);
+	const struct merkle_tree_params *params = &vi->tree_params;
+	unsigned int offs_in_block = pos & (params->block_size - 1);
 	int retval = 0;
 	int err = 0;
 
-	end_offset = min(offset + length, vi->tree_params.tree_size);
-	if (offset >= end_offset)
-		return 0;
-	offs_in_page = offset_in_page(offset);
-	last_index = (end_offset - 1) >> PAGE_SHIFT;
-
 	/*
-	 * Iterate through each Merkle tree page in the requested range and copy
-	 * the requested portion to userspace.  Note that the Merkle tree block
-	 * size isn't important here, as we are returning a byte stream; i.e.,
-	 * we can just work with pages even if the tree block size != PAGE_SIZE.
+	 * Iterate through each Merkle tree block in the requested range and
+	 * copy the requested portion to userspace. Note that we are returning
+	 * a byte stream.
 	 */
-	for (index = offset >> PAGE_SHIFT; index <= last_index; index++) {
-		unsigned long num_ra_pages =
-			min_t(unsigned long, last_index - index + 1,
-			      inode->i_sb->s_bdi->io_pages);
-		unsigned int bytes_to_copy = min_t(u64, end_offset - offset,
-						   PAGE_SIZE - offs_in_page);
-		struct page *page;
-		const void *virt;
+	while (pos < end_pos) {
+		unsigned long ra_bytes;
+		unsigned int bytes_to_copy;
+		struct fsverity_blockbuf block = { };
 
-		page = vops->read_merkle_tree_page(inode, index, num_ra_pages);
-		if (IS_ERR(page)) {
-			err = PTR_ERR(page);
-			fsverity_err(inode,
-				     "Error %d reading Merkle tree page %lu",
-				     err, index);
+		ra_bytes = min_t(unsigned long, end_pos - pos, max_ra_bytes);
+		bytes_to_copy = min_t(u64, end_pos - pos,
+				      params->block_size - offs_in_block);
+
+		err = fsverity_read_merkle_tree_block(inode, &vi->tree_params,
+						      pos - offs_in_block,
+						      ra_bytes, &block);
+		if (err)
 			break;
-		}
 
-		virt = kmap_local_page(page);
-		if (copy_to_user(buf, virt + offs_in_page, bytes_to_copy)) {
-			kunmap_local(virt);
-			put_page(page);
+		if (copy_to_user(buf, block.kaddr + offs_in_block, bytes_to_copy)) {
+			fsverity_drop_merkle_tree_block(inode, &block);
 			err = -EFAULT;
 			break;
 		}
-		kunmap_local(virt);
-		put_page(page);
+		fsverity_drop_merkle_tree_block(inode, &block);
 
 		retval += bytes_to_copy;
 		buf += bytes_to_copy;
-		offset += bytes_to_copy;
+		pos += bytes_to_copy;
 
 		if (fatal_signal_pending(current))  {
 			err = -EINTR;
 			break;
 		}
 		cond_resched();
-		offs_in_page = 0;
+		offs_in_block = 0;
 	}
 	return retval ? retval : err;
 }
diff --git a/fs/verity/verify.c b/fs/verity/verify.c
index 4fcad0825a120..1c4a7c63c0a1c 100644
--- a/fs/verity/verify.c
+++ b/fs/verity/verify.c
@@ -13,12 +13,15 @@ 
 static struct workqueue_struct *fsverity_read_workqueue;
 
 /*
- * Returns true if the hash block with index @hblock_idx in the tree, located in
- * @hpage, has already been verified.
+ * Returns true if the hash @block with index @hblock_idx in the merkle tree
+ * for @inode has already been verified.
  */
-static bool is_hash_block_verified(struct fsverity_info *vi, struct page *hpage,
+static bool is_hash_block_verified(struct inode *inode,
+				   struct fsverity_blockbuf *block,
 				   unsigned long hblock_idx)
 {
+	struct fsverity_info *vi = inode->i_verity_info;
+	struct page *hpage = (struct page *)block->context;
 	unsigned int blocks_per_page;
 	unsigned int i;
 
@@ -90,20 +93,19 @@  static bool is_hash_block_verified(struct fsverity_info *vi, struct page *hpage,
  */
 static bool
 verify_data_block(struct inode *inode, struct fsverity_info *vi,
-		  const void *data, u64 data_pos, unsigned long max_ra_pages)
+		  const void *data, u64 data_pos, unsigned long max_ra_bytes)
 {
 	const struct merkle_tree_params *params = &vi->tree_params;
 	const unsigned int hsize = params->digest_size;
 	int level;
+	unsigned long ra_bytes;
 	u8 _want_hash[FS_VERITY_MAX_DIGEST_SIZE];
 	const u8 *want_hash;
 	u8 real_hash[FS_VERITY_MAX_DIGEST_SIZE];
 	/* The hash blocks that are traversed, indexed by level */
 	struct {
-		/* Page containing the hash block */
-		struct page *page;
-		/* Mapped address of the hash block (will be within @page) */
-		const void *addr;
+		/* Buffer containing the hash block */
+		struct fsverity_blockbuf block;
 		/* Index of the hash block in the tree overall */
 		unsigned long index;
 		/* Byte offset of the wanted hash relative to @addr */
@@ -143,11 +145,9 @@  verify_data_block(struct inode *inode, struct fsverity_info *vi,
 	for (level = 0; level < params->num_levels; level++) {
 		unsigned long next_hidx;
 		unsigned long hblock_idx;
-		pgoff_t hpage_idx;
-		unsigned int hblock_offset_in_page;
+		u64 hblock_pos;
 		unsigned int hoffset;
-		struct page *hpage;
-		const void *haddr;
+		struct fsverity_blockbuf *block = &hblocks[level].block;
 
 		/*
 		 * The index of the block in the current level; also the index
@@ -158,36 +158,29 @@  verify_data_block(struct inode *inode, struct fsverity_info *vi,
 		/* Index of the hash block in the tree overall */
 		hblock_idx = params->level_start[level] + next_hidx;
 
-		/* Index of the hash page in the tree overall */
-		hpage_idx = hblock_idx >> params->log_blocks_per_page;
-
-		/* Byte offset of the hash block within the page */
-		hblock_offset_in_page =
-			(hblock_idx << params->log_blocksize) & ~PAGE_MASK;
+		/* Byte offset of the hash block in the tree overall */
+		hblock_pos = (u64)hblock_idx << params->log_blocksize;
 
 		/* Byte offset of the hash within the block */
 		hoffset = (hidx << params->log_digestsize) &
 			  (params->block_size - 1);
 
-		hpage = inode->i_sb->s_vop->read_merkle_tree_page(inode,
-				hpage_idx, level == 0 ? min(max_ra_pages,
-					params->tree_pages - hpage_idx) : 0);
-		if (IS_ERR(hpage)) {
-			fsverity_err(inode,
-				     "Error %ld reading Merkle tree page %lu",
-				     PTR_ERR(hpage), hpage_idx);
+		if (level == 0)
+			ra_bytes = min_t(u64, max_ra_bytes,
+					 params->tree_size - hblock_pos);
+		else
+			ra_bytes = 0;
+
+		if (fsverity_read_merkle_tree_block(inode, params, hblock_pos,
+						    ra_bytes, block) != 0)
 			goto error;
-		}
-		haddr = kmap_local_page(hpage) + hblock_offset_in_page;
-		if (is_hash_block_verified(vi, hpage, hblock_idx)) {
-			memcpy(_want_hash, haddr + hoffset, hsize);
+
+		if (is_hash_block_verified(inode, block, hblock_idx)) {
+			memcpy(_want_hash, block->kaddr + hoffset, hsize);
 			want_hash = _want_hash;
-			kunmap_local(haddr);
-			put_page(hpage);
+			fsverity_drop_merkle_tree_block(inode, block);
 			goto descend;
 		}
-		hblocks[level].page = hpage;
-		hblocks[level].addr = haddr;
 		hblocks[level].index = hblock_idx;
 		hblocks[level].hoffset = hoffset;
 		hidx = next_hidx;
@@ -197,8 +190,8 @@  verify_data_block(struct inode *inode, struct fsverity_info *vi,
 descend:
 	/* Descend the tree verifying hash blocks. */
 	for (; level > 0; level--) {
-		struct page *hpage = hblocks[level - 1].page;
-		const void *haddr = hblocks[level - 1].addr;
+		struct fsverity_blockbuf *block = &hblocks[level - 1].block;
+		const void *haddr = block->kaddr;
 		unsigned long hblock_idx = hblocks[level - 1].index;
 		unsigned int hoffset = hblocks[level - 1].hoffset;
 
@@ -214,11 +207,10 @@  verify_data_block(struct inode *inode, struct fsverity_info *vi,
 		if (vi->hash_block_verified)
 			set_bit(hblock_idx, vi->hash_block_verified);
 		else
-			SetPageChecked(hpage);
+			SetPageChecked((struct page *)block->context);
 		memcpy(_want_hash, haddr + hoffset, hsize);
 		want_hash = _want_hash;
-		kunmap_local(haddr);
-		put_page(hpage);
+		fsverity_drop_merkle_tree_block(inode, block);
 	}
 
 	/* Finally, verify the data block. */
@@ -235,16 +227,14 @@  verify_data_block(struct inode *inode, struct fsverity_info *vi,
 		     params->hash_alg->name, hsize, want_hash,
 		     params->hash_alg->name, hsize, real_hash);
 error:
-	for (; level > 0; level--) {
-		kunmap_local(hblocks[level - 1].addr);
-		put_page(hblocks[level - 1].page);
-	}
+	for (; level > 0; level--)
+		fsverity_drop_merkle_tree_block(inode, &hblocks[level - 1].block);
 	return false;
 }
 
 static bool
 verify_data_blocks(struct folio *data_folio, size_t len, size_t offset,
-		   unsigned long max_ra_pages)
+		   unsigned long max_ra_bytes)
 {
 	struct inode *inode = data_folio->mapping->host;
 	struct fsverity_info *vi = inode->i_verity_info;
@@ -262,7 +252,7 @@  verify_data_blocks(struct folio *data_folio, size_t len, size_t offset,
 
 		data = kmap_local_folio(data_folio, offset);
 		valid = verify_data_block(inode, vi, data, pos + offset,
-					  max_ra_pages);
+					  max_ra_bytes);
 		kunmap_local(data);
 		if (!valid)
 			return false;
@@ -308,7 +298,7 @@  EXPORT_SYMBOL_GPL(fsverity_verify_blocks);
 void fsverity_verify_bio(struct bio *bio)
 {
 	struct folio_iter fi;
-	unsigned long max_ra_pages = 0;
+	unsigned long max_ra_bytes = 0;
 
 	if (bio->bi_opf & REQ_RAHEAD) {
 		/*
@@ -320,12 +310,12 @@  void fsverity_verify_bio(struct bio *bio)
 		 * This improves sequential read performance, as it greatly
 		 * reduces the number of I/O requests made to the Merkle tree.
 		 */
-		max_ra_pages = bio->bi_iter.bi_size >> (PAGE_SHIFT + 2);
+		max_ra_bytes = bio->bi_iter.bi_size >> 2;
 	}
 
 	bio_for_each_folio_all(fi, bio) {
 		if (!verify_data_blocks(fi.folio, fi.length, fi.offset,
-					max_ra_pages)) {
+					max_ra_bytes)) {
 			bio->bi_status = BLK_STS_IOERR;
 			break;
 		}
@@ -362,3 +352,64 @@  void __init fsverity_init_workqueue(void)
 	if (!fsverity_read_workqueue)
 		panic("failed to allocate fsverity_read_queue");
 }
+
+/**
+ * fsverity_read_merkle_tree_block() - read Merkle tree block
+ * @inode: inode to which this Merkle tree block belongs
+ * @params: merkle tree parameters
+ * @pos: byte position within merkle tree
+ * @ra_bytes: try to read ahead this many bytes
+ * @block: block to be loaded
+ *
+ * This function loads data from a merkle tree.
+ */
+int fsverity_read_merkle_tree_block(struct inode *inode,
+				    const struct merkle_tree_params *params,
+				    u64 pos, unsigned long ra_bytes,
+				    struct fsverity_blockbuf *block)
+{
+	const struct fsverity_operations *vops = inode->i_sb->s_vop;
+	unsigned long page_idx;
+	struct page *page;
+	unsigned long index;
+	unsigned int offset_in_page;
+	int err;
+
+	block->pos = pos;
+	block->size = params->block_size;
+
+	index = pos >> params->log_blocksize;
+	page_idx = round_down(index, params->blocks_per_page);
+	offset_in_page = pos & ~PAGE_MASK;
+
+	page = vops->read_merkle_tree_page(inode, page_idx,
+			ra_bytes >> PAGE_SHIFT);
+	if (IS_ERR(page)) {
+		err = PTR_ERR(page);
+		goto bad;
+	}
+
+	block->kaddr = kmap_local_page(page) + offset_in_page;
+	block->context = page;
+	return 0;
+bad:
+	fsverity_err(inode, "Error %d reading Merkle tree block %llu", err,
+			pos);
+	return err;
+}
+
+/**
+ * fsverity_drop_merkle_tree_block() - release resources acquired by
+ * fsverity_read_merkle_tree_block
+ *
+ * @inode: inode to which this Merkle tree block belongs
+ * @block: block to be released
+ */
+void fsverity_drop_merkle_tree_block(struct inode *inode,
+				     struct fsverity_blockbuf *block)
+{
+	kunmap_local(block->kaddr);
+	put_page((struct page *)block->context);
+	block->kaddr = NULL;
+	block->context = NULL;
+}
diff --git a/include/linux/fsverity.h b/include/linux/fsverity.h
index ac58b19f23d32..05f8e89e0f470 100644
--- a/include/linux/fsverity.h
+++ b/include/linux/fsverity.h
@@ -26,6 +26,25 @@ 
 /* Arbitrary limit to bound the kmalloc() size.  Can be changed. */
 #define FS_VERITY_MAX_DESCRIPTOR_SIZE	16384
 
+/**
+ * struct fsverity_blockbuf - Merkle Tree block buffer
+ * @context: filesystem private context
+ * @kaddr: virtual address of the block's data
+ * @pos: the position of the block in the Merkle tree (in bytes)
+ * @size: the Merkle tree block size
+ *
+ * Buffer containing a single Merkle Tree block.  When fs-verity wants to read
+ * merkle data from disk, it passes the filesystem a buffer with the @pos,
+ * @index, and @size fields filled out.  The filesystem sets @kaddr and
+ * @context.
+ */
+struct fsverity_blockbuf {
+	void *context;
+	void *kaddr;
+	loff_t pos;
+	unsigned int size;
+};
+
 /* Verity operations for filesystems */
 struct fsverity_operations {