[RFC,00/11] fs-verity support for XFS

Message ID	20221213172935.680971-1-aalbersh@redhat.com (mailing list archive)
Headers	show Return-Path: <linux-fsdevel-owner@kernel.org> From: Andrey Albershteyn <aalbersh@redhat.com> To: linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org Cc: Andrey Albershteyn <aalbersh@redhat.com> Subject: [RFC PATCH 00/11] fs-verity support for XFS Date: Tue, 13 Dec 2022 18:29:24 +0100 Message-Id: <20221213172935.680971-1-aalbersh@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	fs-verity support for XFS \| expand [RFC,00/11] fs-verity support for XFS [RFC,01/11] xfs: enable large folios in xfs_setup_inode() [RFC,02/11] pagemap: add mapping_clear_large_folios() wrapper [RFC,03/11] xfs: add attribute type for fs-verity [RFC,04/11] xfs: add fs-verity ro-compat flag [RFC,05/11] xfs: add inode on-disk VERITY flag [RFC,06/11] xfs: initialize fs-verity on file open and cleanup on inode destruction [RFC,07/11] xfs: disable direct read path for fs-verity sealed files [RFC,08/11] xfs: don't enable large folios on fs-verity sealed inode [RFC,09/11] iomap: fs-verity verification on page read [RFC,10/11] xfs: add fs-verity support [RFC,11/11] xfs: add fs-verity ioctls

Andrey Albershteyn Dec. 13, 2022, 5:29 p.m. UTC

Hi all,

This patchset introduces fs-verity [5] support for XFS. This
implementation utilizes extended attributes to store fs-verity
metadata in comparison to ext4/f2fs which store that after EOF. The
pages are stored in the remote extended attributes.

A few starting points:
- The xattr name of a each Merkle tree page is binary
- fs-verity doesn't work with multi-page folios yet. Thus, those are
  disabled when fs-verity is enabled on inode.
- Direct path and DAX are disabled for inodes with fs-verity
- Pages are verified in iomap's read IO path (offloaded with
  workqueue)
- New ro-compat flag is added as inodes with fs-verity have new
  on-disk diflag

Not yet implemented:
- No pre-fetching of Merkle tree pages in the
  read_merkle_tree_page()
- No marking of already verified Merkle tree pages (each read, the
  whole tree is verified).

Preliminary testing:
- fstests 1k, 4k
- More in-depth testing is on the way :)

This patchset depends on Allison's Parent Pointer patchset [1],
which introduces binary names for extended attributes. Particularly,
patch "[PATCH v6 13/27] xfs: Add xfs_verify_pptr" [3] is needed.

The first patch moves setting of large folio support flag to more
appropriate location - xfs_setup_inode(), where other flags are set.
The second one adds wrapper which would be used when already
existing inode is sealed with fs-verity. The rest adds fs-verity
support.

Allison's Parent Pointer patchset v6:
[1]: https://lore.kernel.org/linux-xfs/20221129211242.2689855-1-allison.henderson@oracle.com/

Allison's Parent Pointer branch:
[2]: https://github.com/allisonhenderson/xfs/tree/xfs_new_pptrsv6

Patch which adds handling of xattr binary names:
[3]: https://lore.kernel.org/linux-xfs/20221129211242.2689855-14-allison.henderson@oracle.com/

This patchset branch:
[4]: https://github.com/alberand/linux/tree/xfs-verity

fs-verity docs:
[5]: https://www.kernel.org/doc/html/latest/filesystems/fsverity.html

I'm looking forward for your comments.

Thanks!
Andrey

Andrey Albershteyn (11):
  xfs: enable large folios in xfs_setup_inode()
  pagemap: add mapping_clear_large_folios() wrapper
  xfs: add attribute type for fs-verity
  xfs: add fs-verity ro-compat flag
  xfs: add inode on-disk VERITY flag
  xfs: initialize fs-verity on file open and cleanup on inode
    destruction
  xfs: disable direct read path for fs-verity sealed files
  xfs: don't enable large folios on fs-verity sealed inode
  iomap: fs-verity verification on page read
  xfs: add fs-verity support
  xfs: add fs-verity ioctls

 fs/iomap/buffered-io.c         |  80 ++++++++++++-
 fs/xfs/Makefile                |   1 +
 fs/xfs/libxfs/xfs_attr.c       |   8 ++
 fs/xfs/libxfs/xfs_da_format.h  |   5 +-
 fs/xfs/libxfs/xfs_format.h     |  14 ++-
 fs/xfs/libxfs/xfs_log_format.h |   1 +
 fs/xfs/libxfs/xfs_sb.c         |   2 +
 fs/xfs/xfs_file.c              |  22 +++-
 fs/xfs/xfs_icache.c            |   2 -
 fs/xfs/xfs_inode.c             |   2 +
 fs/xfs/xfs_inode.h             |   1 +
 fs/xfs/xfs_ioctl.c             |  11 ++
 fs/xfs/xfs_iops.c              |   9 ++
 fs/xfs/xfs_mount.h             |   2 +
 fs/xfs/xfs_super.c             |  12 ++
 fs/xfs/xfs_trace.h             |   1 +
 fs/xfs/xfs_verity.c            | 203 +++++++++++++++++++++++++++++++++
 fs/xfs/xfs_verity.h            |  19 +++
 fs/xfs/xfs_xattr.c             |   3 +
 include/linux/iomap.h          |   5 +
 include/linux/pagemap.h        |   5 +
 21 files changed, 393 insertions(+), 15 deletions(-)
 create mode 100644 fs/xfs/xfs_verity.c
 create mode 100644 fs/xfs/xfs_verity.h

Eric Biggers Dec. 13, 2022, 8:50 p.m. UTC | #1

On Tue, Dec 13, 2022 at 06:29:24PM +0100, Andrey Albershteyn wrote:
> Not yet implemented:
> - No pre-fetching of Merkle tree pages in the
>   read_merkle_tree_page()

This would be helpful, but not essential.

> - No marking of already verified Merkle tree pages (each read, the
>   whole tree is verified).

This is essential to have, IMO.

You *could* do what btrfs does, where it caches the Merkle tree pages in the
inode's page cache past i_size, even though btrfs stores the Merkle tree
separately from the file data on-disk.

However, I'd guess that the other XFS developers would have an adversion to that
approach, even though it would not affect the on-disk storage.

The alternatives would be to create a separate in-memory-only inode for the
cache, or to build a custom cache with its own shrinker.

- Eric

Dave Chinner Dec. 13, 2022, 10:11 p.m. UTC | #2

On Tue, Dec 13, 2022 at 12:50:28PM -0800, Eric Biggers wrote:
> On Tue, Dec 13, 2022 at 06:29:24PM +0100, Andrey Albershteyn wrote:
> > Not yet implemented:
> > - No pre-fetching of Merkle tree pages in the
> >   read_merkle_tree_page()
> 
> This would be helpful, but not essential.
> 
> > - No marking of already verified Merkle tree pages (each read, the
> >   whole tree is verified).

Ah, I wasn't aware that this was missing.

> 
> This is essential to have, IMO.
> 
> You *could* do what btrfs does, where it caches the Merkle tree pages in the
> inode's page cache past i_size, even though btrfs stores the Merkle tree
> separately from the file data on-disk.
>
> However, I'd guess that the other XFS developers would have an adversion to that
> approach, even though it would not affect the on-disk storage.

Yup, on an architectural level it just seems wrong to cache secure
verification metadata in the same user accessible address space as
the data it verifies.

> The alternatives would be to create a separate in-memory-only inode for the
> cache, or to build a custom cache with its own shrinker.

The merkel tree blocks are cached in the XFS buffer cache.

Andrey, could we just add a new flag to the xfs_buf->b_flags to
indicate that the buffer contains verified merkle tree records?
i.e. if it's not set after we've read the buffer, we need to verify
the buffer and set th verified buffer in cache and we can skip the
verification?

Cheers,

Dave.

Eric Biggers Dec. 14, 2022, 6:31 a.m. UTC | #3

On Wed, Dec 14, 2022 at 09:11:39AM +1100, Dave Chinner wrote:
> On Tue, Dec 13, 2022 at 12:50:28PM -0800, Eric Biggers wrote:
> > On Tue, Dec 13, 2022 at 06:29:24PM +0100, Andrey Albershteyn wrote:
> > > Not yet implemented:
> > > - No pre-fetching of Merkle tree pages in the
> > >   read_merkle_tree_page()
> > 
> > This would be helpful, but not essential.
> > 
> > > - No marking of already verified Merkle tree pages (each read, the
> > >   whole tree is verified).
> 
> Ah, I wasn't aware that this was missing.
> 
> > 
> > This is essential to have, IMO.
> > 
> > You *could* do what btrfs does, where it caches the Merkle tree pages in the
> > inode's page cache past i_size, even though btrfs stores the Merkle tree
> > separately from the file data on-disk.
> >
> > However, I'd guess that the other XFS developers would have an adversion to that
> > approach, even though it would not affect the on-disk storage.
> 
> Yup, on an architectural level it just seems wrong to cache secure
> verification metadata in the same user accessible address space as
> the data it verifies.
> 
> > The alternatives would be to create a separate in-memory-only inode for the
> > cache, or to build a custom cache with its own shrinker.
> 
> The merkel tree blocks are cached in the XFS buffer cache.
> 
> Andrey, could we just add a new flag to the xfs_buf->b_flags to
> indicate that the buffer contains verified merkle tree records?
> i.e. if it's not set after we've read the buffer, we need to verify
> the buffer and set th verified buffer in cache and we can skip the
> verification?

Well, my proposal at
https://lore.kernel.org/r/20221028224539.171818-2-ebiggers@kernel.org is to keep
tracking the "verified" status at the individual Merkle tree block level, by
adding a bitmap fsverity_info::hash_block_verified.  That is part of the
fs/verity/ infrastructure, and all filesystems would be able to use it.

However, since it's necessary to re-verify blocks that have been evicted and
then re-instantiated, my patch also repurposes PG_checked as an indicator for
whether the Merkle tree pages are newly instantiated.  For a "non-page-cache
cache", that part would need to be replaced with something equivalent.

A different aproach would be to make it so that every time a page (or "cache
buffer", to call it something more generic) of N Merkle tree blocks is read,
then all N of those blocks are verified immediately.  Then there would be no
need to track the "verified" status of individual blocks.

My concerns with that approach are:

  * Most data reads only need a single Merkle tree block at the deepest level.
    If at least N tree blocks were verified any time that any were verified at
    all, that would make the worst-case read latency worse.

  * It's possible that the parents of N tree blocks are split across a cache
    buffer.  Thus, while N blocks can't have more than N parents, and in
    practice would just have 1-2, those 2 parents could be split into two
    separate cache buffers, with a total length of 2*N.  Verifying all of those
    would really increase the worst-case latency as well.

So I'm thinking that tracking the "verified" status of tree blocks individually
is the right way to go.  But I'd appreciate any other thoughts on this.

- Eric

Dave Chinner Dec. 14, 2022, 11:06 p.m. UTC | #4

On Tue, Dec 13, 2022 at 10:31:42PM -0800, Eric Biggers wrote:
> On Wed, Dec 14, 2022 at 09:11:39AM +1100, Dave Chinner wrote:
> > On Tue, Dec 13, 2022 at 12:50:28PM -0800, Eric Biggers wrote:
> > > On Tue, Dec 13, 2022 at 06:29:24PM +0100, Andrey Albershteyn wrote:
> > > > Not yet implemented:
> > > > - No pre-fetching of Merkle tree pages in the
> > > >   read_merkle_tree_page()
> > > 
> > > This would be helpful, but not essential.
> > > 
> > > > - No marking of already verified Merkle tree pages (each read, the
> > > >   whole tree is verified).
> > 
> > Ah, I wasn't aware that this was missing.
> > 
> > > 
> > > This is essential to have, IMO.
> > > 
> > > You *could* do what btrfs does, where it caches the Merkle tree pages in the
> > > inode's page cache past i_size, even though btrfs stores the Merkle tree
> > > separately from the file data on-disk.
> > >
> > > However, I'd guess that the other XFS developers would have an adversion to that
> > > approach, even though it would not affect the on-disk storage.
> > 
> > Yup, on an architectural level it just seems wrong to cache secure
> > verification metadata in the same user accessible address space as
> > the data it verifies.
> > 
> > > The alternatives would be to create a separate in-memory-only inode for the
> > > cache, or to build a custom cache with its own shrinker.
> > 
> > The merkel tree blocks are cached in the XFS buffer cache.
> > 
> > Andrey, could we just add a new flag to the xfs_buf->b_flags to
> > indicate that the buffer contains verified merkle tree records?
> > i.e. if it's not set after we've read the buffer, we need to verify
> > the buffer and set th verified buffer in cache and we can skip the
> > verification?
> 
> Well, my proposal at
> https://lore.kernel.org/r/20221028224539.171818-2-ebiggers@kernel.org is to keep
> tracking the "verified" status at the individual Merkle tree block level, by
> adding a bitmap fsverity_info::hash_block_verified.  That is part of the
> fs/verity/ infrastructure, and all filesystems would be able to use it.

Yeah, i had a look at that rewrite of the verification code last
night - I get the gist of what it is doing, but a single patch of
that complexity is largely impossible to sanely review...

Correct me if I'm wrong, but won't using a bitmap with 1 bit per
verified block cause problems with contiguous memory allocation
pretty quickly? i.e. a 64kB bitmap only tracks 512k blocks, which is
only 2GB of merkle tree data. Hence at file sizes of 100+GB, the
bitmap would have to be kvmalloc()d to guarantee allocation will
succeed.

I'm not really worried about the bitmap memory usage, just that it
handles large contiguous allocations sanely. I suspect we may
eventually need a sparse bitmap (e.g. the old btrfs bit-radix
implementation) to track verification in really large files
efficiently.

> However, since it's necessary to re-verify blocks that have been evicted and
> then re-instantiated, my patch also repurposes PG_checked as an indicator for
> whether the Merkle tree pages are newly instantiated.  For a "non-page-cache
> cache", that part would need to be replaced with something equivalent.

Which we could get as a boolean state from the XFS buffer cache
fairly easily - did we find the buffer in cache, or was it read from
disk...

> A different aproach would be to make it so that every time a page (or "cache
> buffer", to call it something more generic) of N Merkle tree blocks is read,
> then all N of those blocks are verified immediately.  Then there would be no
> need to track the "verified" status of individual blocks.

That won't work with XFS - merkle tree blocks are not contiguous in
the attribute b-tree so there is no efficient "sequential bulk read"
option available. The xattr structure is largely chosen because it
allows for fast, deterministic single merkle tree block
operations....

> My concerns with that approach are:
> 
>   * Most data reads only need a single Merkle tree block at the deepest level.

Yup, see above. :)

>     If at least N tree blocks were verified any time that any were verified at
>     all, that would make the worst-case read latency worse.

*nod*

>   * It's possible that the parents of N tree blocks are split across a cache
>     buffer.  Thus, while N blocks can't have more than N parents, and in
>     practice would just have 1-2, those 2 parents could be split into two
>     separate cache buffers, with a total length of 2*N.  Verifying all of those
>     would really increase the worst-case latency as well.
> 
> So I'm thinking that tracking the "verified" status of tree blocks individually
> is the right way to go.  But I'd appreciate any other thoughts on this.

I think that having the fsverity code track verified indexes itself
is a much more felxible and self contained and the right way to go
about it.

The other issue is that verify_page() assumes that it can drop the
reference to the cached object itself - the caller actually owns the
reference to the object, not the verify_page() code. Hence if we are
passing opaque buffers to verify_page() rather page cache pages, we
need a ->drop_block method that gets called instead of put_page().
This will allow the filesystem to hold a reference to the merkle
tree block data while the verification occurs, ensuring that they
don't get reclaimed by memory pressure whilst still in use...

Cheers,

Dave.

Eric Biggers Dec. 15, 2022, 6:47 a.m. UTC | #5

On Thu, Dec 15, 2022 at 10:06:32AM +1100, Dave Chinner wrote:
> > Well, my proposal at
> > https://lore.kernel.org/r/20221028224539.171818-2-ebiggers@kernel.org is to keep
> > tracking the "verified" status at the individual Merkle tree block level, by
> > adding a bitmap fsverity_info::hash_block_verified.  That is part of the
> > fs/verity/ infrastructure, and all filesystems would be able to use it.
> 
> Yeah, i had a look at that rewrite of the verification code last
> night - I get the gist of what it is doing, but a single patch of
> that complexity is largely impossible to sanely review...

Thanks for taking a look at it.  It doesn't really lend itself to being split
up, unfortunately, but I'll see what I can do.

> Correct me if I'm wrong, but won't using a bitmap with 1 bit per
> verified block cause problems with contiguous memory allocation
> pretty quickly? i.e. a 64kB bitmap only tracks 512k blocks, which is
> only 2GB of merkle tree data. Hence at file sizes of 100+GB, the
> bitmap would have to be kvmalloc()d to guarantee allocation will
> succeed.
> 
> I'm not really worried about the bitmap memory usage, just that it
> handles large contiguous allocations sanely. I suspect we may
> eventually need a sparse bitmap (e.g. the old btrfs bit-radix
> implementation) to track verification in really large files
> efficiently.

Well, that's why my patch uses kvmalloc() to allocate the bitmap.

I did originally think it was going to have to be a sparse bitmap that ties into
the shrinker so that pages of it can be evicted.  But if you do the math, the
required bitmap size is only 1 / 2^22 the size of the file, assuming the Merkle
tree uses SHA-256 and 4K blocks.  So a 100MB file only needs a 24-byte bitmap,
and the bitmap for any file under 17GB fits in a 4K page.

My patch puts an arbitrary limit at a 1 MiB bitmap, which would be a 4.4TB file.

It's not ideal to say "4 TB Ought To Be Enough For Anybody".  But it does feel
that it's not currently worth the extra complexity and runtime overhead of
implementing a full-blown sparse bitmap with cache eviction support, when no one
currently has a use case for fsverity on files anywhere near that large.

> The other issue is that verify_page() assumes that it can drop the
> reference to the cached object itself - the caller actually owns the
> reference to the object, not the verify_page() code. Hence if we are
> passing opaque buffers to verify_page() rather page cache pages, we
> need a ->drop_block method that gets called instead of put_page().
> This will allow the filesystem to hold a reference to the merkle
> tree block data while the verification occurs, ensuring that they
> don't get reclaimed by memory pressure whilst still in use...

Yes, probably the prototype of ->read_merkle_tree_page will need to change too.

- Eric

Dave Chinner Dec. 15, 2022, 8:57 p.m. UTC | #6

On Wed, Dec 14, 2022 at 10:47:37PM -0800, Eric Biggers wrote:
> On Thu, Dec 15, 2022 at 10:06:32AM +1100, Dave Chinner wrote:
> > > Well, my proposal at
> > > https://lore.kernel.org/r/20221028224539.171818-2-ebiggers@kernel.org is to keep
> > > tracking the "verified" status at the individual Merkle tree block level, by
> > > adding a bitmap fsverity_info::hash_block_verified.  That is part of the
> > > fs/verity/ infrastructure, and all filesystems would be able to use it.
> > 
> > Yeah, i had a look at that rewrite of the verification code last
> > night - I get the gist of what it is doing, but a single patch of
> > that complexity is largely impossible to sanely review...
> 
> Thanks for taking a look at it.  It doesn't really lend itself to being split
> up, unfortunately, but I'll see what I can do.
> 
> > Correct me if I'm wrong, but won't using a bitmap with 1 bit per
> > verified block cause problems with contiguous memory allocation
> > pretty quickly? i.e. a 64kB bitmap only tracks 512k blocks, which is
> > only 2GB of merkle tree data. Hence at file sizes of 100+GB, the
> > bitmap would have to be kvmalloc()d to guarantee allocation will
> > succeed.
> > 
> > I'm not really worried about the bitmap memory usage, just that it
> > handles large contiguous allocations sanely. I suspect we may
> > eventually need a sparse bitmap (e.g. the old btrfs bit-radix
> > implementation) to track verification in really large files
> > efficiently.
> 
> Well, that's why my patch uses kvmalloc() to allocate the bitmap.
> 
> I did originally think it was going to have to be a sparse bitmap that ties into
> the shrinker so that pages of it can be evicted.  But if you do the math, the
> required bitmap size is only 1 / 2^22 the size of the file, assuming the Merkle
> tree uses SHA-256 and 4K blocks.  So a 100MB file only needs a 24-byte bitmap,
> and the bitmap for any file under 17GB fits in a 4K page.
> 
> My patch puts an arbitrary limit at a 1 MiB bitmap, which would be a 4.4TB file.
> 
> It's not ideal to say "4 TB Ought To Be Enough For Anybody".  But it does feel
> that it's not currently worth the extra complexity and runtime overhead of
> implementing a full-blown sparse bitmap with cache eviction support, when no one
> currently has a use case for fsverity on files anywhere near that large.

I think we can live with that for the moment, but I suspect that 4TB
filesize limit will become an issue sooner rather than later. What
will happen if someone tries to measure a file larger than this
limit? What's the failure mode?

Cheers,

Dave.

Eric Biggers Dec. 16, 2022, 5:04 a.m. UTC | #7

On Fri, Dec 16, 2022 at 07:57:37AM +1100, Dave Chinner wrote:
> > 
> > Well, that's why my patch uses kvmalloc() to allocate the bitmap.
> > 
> > I did originally think it was going to have to be a sparse bitmap that ties into
> > the shrinker so that pages of it can be evicted.  But if you do the math, the
> > required bitmap size is only 1 / 2^22 the size of the file, assuming the Merkle
> > tree uses SHA-256 and 4K blocks.  So a 100MB file only needs a 24-byte bitmap,
> > and the bitmap for any file under 17GB fits in a 4K page.
> > 
> > My patch puts an arbitrary limit at a 1 MiB bitmap, which would be a 4.4TB file.
> > 
> > It's not ideal to say "4 TB Ought To Be Enough For Anybody".  But it does feel
> > that it's not currently worth the extra complexity and runtime overhead of
> > implementing a full-blown sparse bitmap with cache eviction support, when no one
> > currently has a use case for fsverity on files anywhere near that large.
> 
> I think we can live with that for the moment, but I suspect that 4TB
> filesize limit will become an issue sooner rather than later. What
> will happen if someone tries to measure a file larger than this
> limit? What's the failure mode?
> 

FS_IOC_ENABLE_VERITY will fail with EFBIG.

- Eric

[RFC,00/11] fs-verity support for XFS

Message

Comments