diff mbox series

[v2,01/12] fs-verity: add a documentation file

Message ID 20181101225230.88058-2-ebiggers@kernel.org (mailing list archive)
State New, archived
Headers show
Series fs-verity: read-only file-based authenticity protection | expand

Commit Message

Eric Biggers Nov. 1, 2018, 10:52 p.m. UTC
From: Eric Biggers <ebiggers@google.com>

Add a documentation file for fs-verity, covering:

- Introduction
- Use cases
- Metadata format
    - Merkle tree
    - fs-verity descriptor
    - fsveritysetup format
- Filesystem support
    - ext4
    - f2fs
- User API
    - FS_IOC_ENABLE_VERITY
    - FS_IOC_MEASURE_VERITY
- Access semantics
- In-kernel policies
    - Built-in signature verification
- Implementation details
    - I/O path design
- Userspace utility
- Tests

Signed-off-by: Eric Biggers <ebiggers@google.com>
---
 Documentation/filesystems/fsverity.rst | 583 +++++++++++++++++++++++++
 Documentation/filesystems/index.rst    |  11 +
 2 files changed, 594 insertions(+)
 create mode 100644 Documentation/filesystems/fsverity.rst

Comments

Christoph Hellwig Dec. 12, 2018, 9:14 a.m. UTC | #1
As this apparently got merged despite no proper reviews from VFS
level persons:

NAK - the ioctl format that expects the verifycation hash in the file
data data with padding after the real data is simply not acceptable,
we can't just transform the data in the file itself based on a magic
calls like this.

Also the core code should not depend on this as a storage format,
which is a rather bad idea.  In any modern file system you can
store data like this out of line in something like the attr fork
in XFS, or the attr items in btrfs.
Eric Biggers Dec. 12, 2018, 8:26 p.m. UTC | #2
Hi Christoph,

On Wed, Dec 12, 2018 at 01:14:06AM -0800, Christoph Hellwig wrote:
> As this apparently got merged despite no proper reviews from VFS
> level persons:

fs-verity has been out for review since August, and Cc'ed to all relevant
mailing lists including linux-fsdevel, linux-ext4, linux-f2fs-devel,
linux-fscrypt, linux-integrity, and linux-kernel.  There are tests,
documentation (since v2), and a userspace tool.  It's also been presented at
multiple conferences, and has been covered by LWN multiple times.  If more
people want to review it, then they should do so; there's nothing stopping them.

> 
> NAK - the ioctl format that expects the verifycation hash in the file
> data data with padding after the real data is simply not acceptable,
> we can't just transform the data in the file itself based on a magic
> calls like this.
> 

Can you elaborate on the actual problems you think the current solution has, and
exactly what solution you'd prefer instead?  Keep in mind that (1) for large
files the Merkle tree can be gigabytes long, (2) Linux doesn't have an API for
file streams, and (3) when fs-verity is combined with fscrypt, it's important
that the hashes be encrypted, so as to not leak information about the plaintext.

> Also the core code should not depend on this as a storage format,
> which is a rather bad idea.  In any modern file system you can
> store data like this out of line in something like the attr fork
> in XFS, or the attr items in btrfs.

As explained in the documentation, the core code uses the "metadata after EOF"
format for the API, but not necessarily the on-disk format.  I.e.,
FS_IOC_ENABLE_VERITY requires it, but during the ioctl the filesystem can choose
to move the metadata into a different location, such as a file stream.

We'd just need to update fsverity_read_metadata_page() and
compute_tree_depth_and_offsets() to call out to the filesystem's
fsverity_operations to read a metadata page and get the offset of the first
metadata page, respectively.  The rest of fs/verity/ will still work.  I'd be
glad to add those two fsverity_operations now, though they're not needed for
ext4 and f2fs, if it would help clarify things further.

Thanks,

- Eric
Christoph Hellwig Dec. 13, 2018, 8:22 p.m. UTC | #3
On Wed, Dec 12, 2018 at 12:26:10PM -0800, Eric Biggers wrote:
> > As this apparently got merged despite no proper reviews from VFS
> > level persons:
> 
> fs-verity has been out for review since August, and Cc'ed to all relevant
> mailing lists including linux-fsdevel, linux-ext4, linux-f2fs-devel,
> linux-fscrypt, linux-integrity, and linux-kernel.  There are tests,
> documentation (since v2), and a userspace tool.  It's also been presented at
> multiple conferences, and has been covered by LWN multiple times.  If more
> people want to review it, then they should do so; there's nothing stopping them.

But you did not got a review from someone like Al, Linus, Andrew or me,
did you?

> Can you elaborate on the actual problems you think the current solution has, and
> exactly what solution you'd prefer instead?  Keep in mind that (1) for large
> files the Merkle tree can be gigabytes long, (2) Linux doesn't have an API for
> file streams, and (3) when fs-verity is combined with fscrypt, it's important
> that the hashes be encrypted, so as to not leak information about the plaintext.

Given that you alread use an ioctl as the interface what is the problem
of passing this data through the ioctl?
Eric Biggers Dec. 14, 2018, 4:48 a.m. UTC | #4
Hi Christoph,

On Thu, Dec 13, 2018 at 12:22:49PM -0800, Christoph Hellwig wrote:
> On Wed, Dec 12, 2018 at 12:26:10PM -0800, Eric Biggers wrote:
> > > As this apparently got merged despite no proper reviews from VFS
> > > level persons:
> > 
> > fs-verity has been out for review since August, and Cc'ed to all relevant
> > mailing lists including linux-fsdevel, linux-ext4, linux-f2fs-devel,
> > linux-fscrypt, linux-integrity, and linux-kernel.  There are tests,
> > documentation (since v2), and a userspace tool.  It's also been presented at
> > multiple conferences, and has been covered by LWN multiple times.  If more
> > people want to review it, then they should do so; there's nothing stopping them.
> 
> But you did not got a review from someone like Al, Linus, Andrew or me,
> did you?

Sure, those specific people (modulo you just now) haven't responded to the
fs-verity patches yet.  But again, the patches have been out for review for
months.  Of course, we always prefer more reviews over fewer, and we strongly
encourage anyone interested to review fs-verity!  (The Documentation/ file may
be a good place to start.)  But ultimately we cannot force reviews, and as you
know kernel reviews can be very hard to come by.  Yet, people still need
fs-verity anyway; it isn't just some toy.  And we're committed to maintaining
it, similar to fscrypt.  The ext4 and f2fs maintainers are also satisfied with
the current approach to storing the verity metadata past EOF; in fact it was
even originally Ted's idea, I think.

> 
> > Can you elaborate on the actual problems you think the current solution has, and
> > exactly what solution you'd prefer instead?  Keep in mind that (1) for large
> > files the Merkle tree can be gigabytes long, (2) Linux doesn't have an API for
> > file streams, and (3) when fs-verity is combined with fscrypt, it's important
> > that the hashes be encrypted, so as to not leak information about the plaintext.
> 
> Given that you alread use an ioctl as the interface what is the problem
> of passing this data through the ioctl?

Do you mean pass the verity metadata in a buffer?  That cannot work in general,
because it may be too large to fit into memory.

Or do you mean pass it via a second file descriptor?  That could work, but it
doesn't seem better than the current approach.  It would force every filesystem
to move the metadata around, whereas currently ext4 and f2fs can simply leave it
in place.  If you meant this, are there advantages you have in mind that would
outweigh this?

We also considered generating the Merkle tree in the kernel, in which case
FS_IOC_ENABLE_VERITY would just take a small structure similar to the current
fsverity_descriptor.  But that would add extra complexity to the kernel, and
generating a Merkle tree over a large file is the type of parallelizable, CPU
intensive work that really should be done in userspace.  Also, having userspace
provide the Merkle tree allows for it to be pre-generated and distributed with
the file, e.g. provided in a package to be installed on many systems.

But please do let us know if you have any better ideas.

Thanks!

- Eric
Theodore Ts'o Dec. 14, 2018, 5:17 a.m. UTC | #5
On Thu, Dec 13, 2018 at 12:22:49PM -0800, Christoph Hellwig wrote:
> On Wed, Dec 12, 2018 at 12:26:10PM -0800, Eric Biggers wrote:
> > > As this apparently got merged despite no proper reviews from VFS
> > > level persons:
> > 
> > fs-verity has been out for review since August, and Cc'ed to all relevant
> > mailing lists including linux-fsdevel, linux-ext4, linux-f2fs-devel,
> > linux-fscrypt, linux-integrity, and linux-kernel.  There are tests,
> > documentation (since v2), and a userspace tool.  It's also been presented at
> > multiple conferences, and has been covered by LWN multiple times.  If more
> > people want to review it, then they should do so; there's nothing stopping them.
> 
> But you did not got a review from someone like Al, Linus, Andrew or me,
> did you?

I don't consider fs-verity to be part of core VFS, but rather a
library that happens to be used by ext4 and f2fs.  This is much like
fscrypt, which was originally an ext4-only thing, but the code was
always set up so it could be used by other file systems, and when f2fs
was interested in using it, we moved it to fs/crypto.  As such the
fscrypto code never got a review from Al, Andrew, or you, and when I
pushed it to Linus, he accepted the pull request.

The difference this time is that ext4 and f2fs are interested in using
common code from the beginning.

> > Can you elaborate on the actual problems you think the current solution has, and
> > exactly what solution you'd prefer instead?  Keep in mind that (1) for large
> > files the Merkle tree can be gigabytes long, (2) Linux doesn't have an API for
> > file streams, and (3) when fs-verity is combined with fscrypt, it's important
> > that the hashes be encrypted, so as to not leak information about the plaintext.
> 
> Given that you alread use an ioctl as the interface what is the problem
> of passing this data through the ioctl?

The size of the Merkle tree is roughly size/129.  So for a 100MB file
(and there can be Android APK files that bug), the Merkle tree could
be almost 800k.  That's not really a size that we would want to push
through an ioctl.

We could treat the ioctl as write-like interface, but using write(2)
seemed to make a lot more sense.  Also, the fscrypt common code
leveraged by f2fs and ext4 assume that the verity tree will be stored
after the data blocks.

Given that the semantics of a verity-protected file is that it is
immutable, you *could* store the Merkle tree in a separate file
stream, but it really doesn't buy you anything --- by definition, you
can't append to a fs-verity protected file.  Furthermore, it would
require extra complexity in the common fsverity code --- which looks
for the Merkle tree at the end of file data --- for no real benefit.

Cheers,

						- Ted

P.S.  And if you've purchased a Pixel 3 device, it's already using the
fsverity code, so it's quite well tested (and yes, we have xfstests).
Eric Biggers Dec. 14, 2018, 5:39 a.m. UTC | #6
On Fri, Dec 14, 2018 at 12:17:22AM -0500, Theodore Y. Ts'o wrote:
> Furthermore, it would require extra complexity in the common fsverity code
> --- which looks for the Merkle tree at the end of file data --- for no real
> benefit.

To clarify, while this is technically true currently, as I mentioned it's been
kept flexible enough such that a filesystem *could* store the metadata elsewhere
with only some slight changes to the common fs/verity/ code which won't break
other filesystems.  Though of course, keeping all filesystems using the
"metadata after EOF" approach does allow a couple simplifications.

- Eric
Christoph Hellwig Dec. 17, 2018, 4:49 p.m. UTC | #7
On Thu, Dec 13, 2018 at 08:48:03PM -0800, Eric Biggers wrote:
> Sure, those specific people (modulo you just now) haven't responded to the
> fs-verity patches yet.  But again, the patches have been out for review for
> months.  Of course, we always prefer more reviews over fewer, and we strongly
> encourage anyone interested to review fs-verity!  (The Documentation/ file may
> be a good place to start.)  But ultimately we cannot force reviews, and as you
> know kernel reviews can be very hard to come by.  Yet, people still need
> fs-verity anyway; it isn't just some toy.  And we're committed to maintaining
> it, similar to fscrypt.  The ext4 and f2fs maintainers are also satisfied with
> the current approach to storing the verity metadata past EOF; in fact it was
> even originally Ted's idea, I think.

But you also can't force inclusion.  And Linus just recently complained
about merging common code patches through trees for a specific fs
without proper VFS ACKs.  And that was a for a case without userspace
ABI implications, so we really need a much better review here.

Including a VFS person ACK, CC to linux-abi and man pages for the
interface.

> > Given that you alread use an ioctl as the interface what is the problem
> > of passing this data through the ioctl?
> 
> Do you mean pass the verity metadata in a buffer?  That cannot work in general,
> because it may be too large to fit into memory.

Have a pointer in the ioctl and do get_user_pages on it.
Christoph Hellwig Dec. 17, 2018, 4:52 p.m. UTC | #8
[FYI, your mail never made it to my inbox, although I found the copy
in linux-fsdevel now]

On Fri, Dec 14, 2018 at 12:17:22AM -0500, Theodore Y. Ts'o wrote:
> I don't consider fs-verity to be part of core VFS, but rather a
> library that happens to be used by ext4 and f2fs.  This is much like
> fscrypt, which was originally an ext4-only thing, but the code was
> always set up so it could be used by other file systems, and when f2fs
> was interested in using it, we moved it to fs/crypto.  As such the
> fscrypto code never got a review from Al, Andrew, or you, and when I
> pushed it to Linus, he accepted the pull request.

And as a result we are stuck with a pretty bad interface, so this is
a very good example for how to not do thing!  Just because a user
interface is only implemented by one or two file systems doesn't mean
it should skip the userspace ABI review, because we tend to generalize
them unless they are deeply specific to fs internals.

> P.S.  And if you've purchased a Pixel 3 device, it's already using the
> fsverity code, so it's quite well tested (and yes, we have xfstests).

And all kinds of other code that would never pass review, so that isn't
really a good argument unfortunately :(  Note that I would want to buy
a piece of hardware coming with google spyware preinstalled.
Eric Biggers Dec. 17, 2018, 6:32 p.m. UTC | #9
On Mon, Dec 17, 2018 at 08:49:49AM -0800, Christoph Hellwig wrote:
> 
> > > Given that you alread use an ioctl as the interface what is the problem
> > > of passing this data through the ioctl?
> > 
> > Do you mean pass the verity metadata in a buffer?  That cannot work in general,
> > because it may be too large to fit into memory.
> 
> Have a pointer in the ioctl and do get_user_pages on it.

I don't see how that helps.  The Merkle tree can still be too large to fit in
memory.  In the worst case, it might not even fit in the address space.  And I
don't see how get_user_pages() helps either over just copy_from_user(); what are
you proposing to do with the pages after getting them, exactly?

- Eric
Eric Biggers Dec. 17, 2018, 7:15 p.m. UTC | #10
Hi Christoph,

On Mon, Dec 17, 2018 at 08:52:31AM -0800, Christoph Hellwig wrote:
> [FYI, your mail never made it to my inbox, although I found the copy
> in linux-fsdevel now]
> 
> On Fri, Dec 14, 2018 at 12:17:22AM -0500, Theodore Y. Ts'o wrote:
> > I don't consider fs-verity to be part of core VFS, but rather a
> > library that happens to be used by ext4 and f2fs.  This is much like
> > fscrypt, which was originally an ext4-only thing, but the code was
> > always set up so it could be used by other file systems, and when f2fs
> > was interested in using it, we moved it to fs/crypto.  As such the
> > fscrypto code never got a review from Al, Andrew, or you, and when I
> > pushed it to Linus, he accepted the pull request.
> 
> And as a result we are stuck with a pretty bad interface, so this is
> a very good example for how to not do thing!  Just because a user
> interface is only implemented by one or two file systems doesn't mean
> it should skip the userspace ABI review, because we tend to generalize
> them unless they are deeply specific to fs internals.
> 

While I do have some improvements planned for the fscrypt interface,
specifically how encryption keys are managed [1], the issues are subtle enough
that I don't think there's any chance they could have been gotten "right" the
first time around, even if lots more people had reviewed it.  It took me over a
year working with fscrypt to put together my proposal for how to improve things,
and it was only really possible because I was able to consider all the people
actually using fscrypt and what problems they are having, if any.

Even so, the current fscrypt interface is actually good enough that there still
hasn't been much real interest in getting my proposed improvements merged yet.
(Not surprisingly, they've also been completely ignored by all the "VFS people"
you say should be reviewing this stuff...)

So for fscrypt I personally don't think that waiting would have changed much in
practice, besides ensuring that users wouldn't have any solution at all.

[1] https://lwn.net/Articles/737274/

- Eric
Darrick J. Wong Dec. 17, 2018, 8 p.m. UTC | #11
On Thu, Dec 13, 2018 at 08:48:03PM -0800, Eric Biggers wrote:
> Hi Christoph,
> 
> On Thu, Dec 13, 2018 at 12:22:49PM -0800, Christoph Hellwig wrote:
> > On Wed, Dec 12, 2018 at 12:26:10PM -0800, Eric Biggers wrote:
> > > > As this apparently got merged despite no proper reviews from VFS
> > > > level persons:
> > > 
> > > fs-verity has been out for review since August, and Cc'ed to all relevant
> > > mailing lists including linux-fsdevel, linux-ext4, linux-f2fs-devel,
> > > linux-fscrypt, linux-integrity, and linux-kernel.  There are tests,
> > > documentation (since v2), and a userspace tool.  It's also been presented at
> > > multiple conferences, and has been covered by LWN multiple times.  If more
> > > people want to review it, then they should do so; there's nothing stopping them.
> > 
> > But you did not got a review from someone like Al, Linus, Andrew or me,
> > did you?
> 
> Sure, those specific people (modulo you just now) haven't responded to the
> fs-verity patches yet.  But again, the patches have been out for review for
> months.  Of course, we always prefer more reviews over fewer, and we strongly
> encourage anyone interested to review fs-verity!  (The Documentation/ file may
> be a good place to start.)  But ultimately we cannot force reviews, and as you
> know kernel reviews can be very hard to come by.  Yet, people still need
> fs-verity anyway; it isn't just some toy.  And we're committed to maintaining
> it, similar to fscrypt.  The ext4 and f2fs maintainers are also satisfied with
> the current approach to storing the verity metadata past EOF; in fact it was
> even originally Ted's idea, I think.
> 
> > 
> > > Can you elaborate on the actual problems you think the current solution has, and
> > > exactly what solution you'd prefer instead?  Keep in mind that (1) for large
> > > files the Merkle tree can be gigabytes long, (2) Linux doesn't have an API for
> > > file streams, and (3) when fs-verity is combined with fscrypt, it's important
> > > that the hashes be encrypted, so as to not leak information about the plaintext.
> > 
> > Given that you alread use an ioctl as the interface what is the problem
> > of passing this data through the ioctl?
> 
> Do you mean pass the verity metadata in a buffer?  That cannot work in general,
> because it may be too large to fit into memory.
> 
> Or do you mean pass it via a second file descriptor?  That could work, but it
> doesn't seem better than the current approach.  It would force every filesystem
> to move the metadata around, whereas currently ext4 and f2fs can simply leave it
> in place.  If you meant this, are there advantages you have in mind that would
> outweigh this?

FWIW, if I were (hypothetically) working on an xfs implementation, I
likely would have settled on passing a reference to a merkle tree
through a (fd, length) pair, because that allows us plenty of options
on the back end:

b) we could remap the tree into a new inode fork for merkle trees, or
a) remap it as posteof blocks like ext4/f2fs does, or
c) remap the blocks into the attribute fork as an (unusually large)
extended attribute value.

If the merkle_fd isn't on the same filesystem as the fd we could at
least use generic_copy_file_range (i.e. page cache copying) to land the
merkle tree wherever we want.

Granted, it's not like we can't do any of those three things given the
current interface.  I gather most of the grumbling has to do with
feeling like we're associating the on-disk format to the ioctl interface
too closely?

I certainly can see why you'd want to avoid having to run a whole bunch
of SWAPEXT operations to set up a verity file, though.

Anyhow, that's just my 2 cents.  :)

--D

> We also considered generating the Merkle tree in the kernel, in which case
> FS_IOC_ENABLE_VERITY would just take a small structure similar to the current
> fsverity_descriptor.  But that would add extra complexity to the kernel, and
> generating a Merkle tree over a large file is the type of parallelizable, CPU
> intensive work that really should be done in userspace.  Also, having userspace
> provide the Merkle tree allows for it to be pre-generated and distributed with
> the file, e.g. provided in a package to be installed on many systems.
> 
> But please do let us know if you have any better ideas.
> 
> Thanks!
> 
> - Eric
Theodore Ts'o Dec. 19, 2018, 12:16 a.m. UTC | #12
On Mon, Dec 17, 2018 at 12:00:39PM -0800, Darrick J. Wong wrote:
> FWIW, if I were (hypothetically) working on an xfs implementation, I
> likely would have settled on passing a reference to a merkle tree
> through a (fd, length) pair, because that allows us plenty of options
> on the back end:
> 
> b) we could remap the tree into a new inode fork for merkle trees, or
> a) remap it as posteof blocks like ext4/f2fs does, or
> c) remap the blocks into the attribute fork as an (unusually large)
> extended attribute value.

Sure, but what would be the benefit of doing different things on the
back end?  I think this is a really more of a philophical objection
than anything else.  With both fsverity and fscrypt, well over 95% of
the implementation is shared between ext4 and f2fs.  And from a
cryptographic design, that's something I consider a feature, not a
bug.  Cryptographic code is subtle in very different ways compared to
file system code.  So it's a good thing to having it done once and
audited by crypto specialists, as opposed to having each file system
doing it differently / independently.

> If the merkle_fd isn't on the same filesystem as the fd we could at
> least use generic_copy_file_range (i.e. page cache copying) to land the
> merkle tree wherever we want.
> 
> Granted, it's not like we can't do any of those three things given the
> current interface.  I gather most of the grumbling has to do with
> feeling like we're associating the on-disk format to the ioctl interface
> too closely?

Right, the current interface makes it somewhat more awkward to do
these other things --- but the question is *why* would you want to in
the first place?  Why add the extra complexity?  I'm a big believer of
the KISS principle, and if there was a reason why a file system would
want to store the Merkle tree somewhere else, we could talk about it,
but I see only downside, and no upside.

      	       		     	- Ted
Dave Chinner Dec. 19, 2018, 2:19 a.m. UTC | #13
On Tue, Dec 18, 2018 at 07:16:03PM -0500, Theodore Y. Ts'o wrote:
> On Mon, Dec 17, 2018 at 12:00:39PM -0800, Darrick J. Wong wrote:
> > FWIW, if I were (hypothetically) working on an xfs implementation, I
> > likely would have settled on passing a reference to a merkle tree
> > through a (fd, length) pair, because that allows us plenty of options
> > on the back end:
> > 
> > b) we could remap the tree into a new inode fork for merkle trees, or
> > a) remap it as posteof blocks like ext4/f2fs does, or
> > c) remap the blocks into the attribute fork as an (unusually large)
> > extended attribute value.
> 
> Sure, but what would be the benefit of doing different things on the
> back end?  I think this is a really more of a philophical objection
> than anything else.

Putting metadata in user files beyond EOF doesn't work with XFS's
post-EOF speculative allocation algorithms.

i.e. Filesystem design/algorithms often assume that the region
beyond EOF in user files is a write-only region.  e.g. We can allow
extents beyond EOF to be uninitialised because they are in a write
only region of the file and so there's no possibility of stale data
exposure. Unfortunately, putting filesystem/security metadata beyond
EOF breaks these assumptions - it's no longer a write-only region.

IOWs, all these existing assumptions and implementation details are
exposed to a new attack surface involving tricking the filesystem
into thinking it has readable data beyond EOF. And because it can
now read from the "write only" region beyond EOF (because that's the
mechanism by which fsverity does it's verification) we no longer
have a clear line of protection against exposing such data to
userspace.

Putting the merkel tree somewhere else in the filesystem metadata
and providing a separate API to manipulate it avoids this problem.
It allows filesystems to keep their internal metadata and
security-related verification information in a separate channel (and
trust path) that is completely out of user data/access scope.

Cheers,

Dave.
Christoph Hellwig Dec. 19, 2018, 7:09 a.m. UTC | #14
On Mon, Dec 17, 2018 at 10:32:06AM -0800, Eric Biggers wrote:
> I don't see how that helps.  The Merkle tree can still be too large to fit in
> memory.  In the worst case, it might not even fit in the address space.  And I
> don't see how get_user_pages() helps either over just copy_from_user(); what are
> you proposing to do with the pages after getting them, exactly?

Write them out to a file system specific area on the media.  Note
that get_user_pages is indeed not going to work if you run out of
address space, but that seems like an odd use case.  Out of of memory
is not an issue as we generally iterate over a small number of pages
for each individual get_user_pages call.
Christoph Hellwig Dec. 19, 2018, 7:11 a.m. UTC | #15
On Mon, Dec 17, 2018 at 12:00:39PM -0800, Darrick J. Wong wrote:
> FWIW, if I were (hypothetically) working on an xfs implementation, I
> likely would have settled on passing a reference to a merkle tree
> through a (fd, length) pair, because that allows us plenty of options
> on the back end:
> 
> b) we could remap the tree into a new inode fork for merkle trees, or
> a) remap it as posteof blocks like ext4/f2fs does, or
> c) remap the blocks into the attribute fork as an (unusually large)
> extended attribute value.
>
> If the merkle_fd isn't on the same filesystem as the fd we could at
> least use generic_copy_file_range (i.e. page cache copying) to land the
> merkle tree wherever we want.

I think the fd would have to be on the same fs for this interface to
make sense. But it could be an O_TMPFILE one.  And given that ext4
already supports a variant of swapext this interface should also work
with the existing ext4 on disk format.
Christoph Hellwig Dec. 19, 2018, 7:14 a.m. UTC | #16
On Tue, Dec 18, 2018 at 07:16:03PM -0500, Theodore Y. Ts'o wrote:
> Sure, but what would be the benefit of doing different things on the
> back end?  I think this is a really more of a philophical objection
> than anything else.  With both fsverity and fscrypt, well over 95% of
> the implementation is shared between ext4 and f2fs.  And from a
> cryptographic design, that's something I consider a feature, not a
> bug.  Cryptographic code is subtle in very different ways compared to
> file system code.  So it's a good thing to having it done once and
> audited by crypto specialists, as opposed to having each file system
> doing it differently / independently.

Where the data is located on disk should not matter for the crypto
details.  If it does you have severe implementation issues.

> Right, the current interface makes it somewhat more awkward to do
> these other things --- but the question is *why* would you want to in
> the first place?  Why add the extra complexity?  I'm a big believer of
> the KISS principle, and if there was a reason why a file system would
> want to store the Merkle tree somewhere else, we could talk about it,
> but I see only downside, and no upside.

Filesystems already use blocks beyond EOF for preallocation, either
speculative by the file system itself, or explicitly by the user with
fallocate.  I bet you will run into bugs with your creative abuse
sooner or later.  Indepnd of that the interface simply is gross, which
is enough of a reason not to merge it.
Christoph Hellwig Dec. 19, 2018, 7:19 a.m. UTC | #17
On Tue, Dec 18, 2018 at 11:16:08PM -0800, Linus Torvalds wrote:
> On Tue, Dec 18, 2018, 23:11 Christoph Hellwig <hch@infradead.org wrote:
> 
> >
> > I think the fd would have to be on the same fs for this interface to
> > make sense. But it could be an O_TMPFILE one.  And given that ext4
> > already supports a variant of swapext this interface should also work
> > with the existing ext4 on disk format.
> >
> 
> Why is the merkle tree not just an xattr of the file?

Because it is too large for our awkward xattrs interface..
Dave Chinner Dec. 19, 2018, 9:35 p.m. UTC | #18
On Wed, Dec 19, 2018 at 02:30:05PM -0500, Theodore Y. Ts'o wrote:
> On Wed, Dec 19, 2018 at 01:19:53PM +1100, Dave Chinner wrote:
> > Putting metadata in user files beyond EOF doesn't work with XFS's
> > post-EOF speculative allocation algorithms.
> > 
> > i.e. Filesystem design/algorithms often assume that the region
> > beyond EOF in user files is a write-only region.  e.g. We can allow
> > extents beyond EOF to be uninitialised because they are in a write
> > only region of the file and so there's no possibility of stale data
> > exposure. Unfortunately, putting filesystem/security metadata beyond
> > EOF breaks these assumptions - it's no longer a write-only region.
> 
> On Tue, Dec 18, 2018 at 11:14:20PM -0800, Christoph Hellwig wrote:
> > Filesystems already use blocks beyond EOF for preallocation, either
> > speculative by the file system itself, or explicitly by the user with
> > fallocate.  I bet you will run into bugs with your creative abuse
> > sooner or later.  Indepnd of that the interface simply is gross, which
> > is enough of a reason not to merge it.
> 
> Both of these concerns aren't applicable for fs-verity because the
> entire file will be read-only.  So there will be no preallocation or
> fallocation going on --- or allowed --- for a file which is protected
> by fs-verity.  Since no writes are allowed at all, it won't break any
> file systems' assumptions about "write-only regions".

The file has to be written before it has been protected, which means
it may very well have user space allocated beyond EOF before the
merkle tree needs to be written. And, well, the fact that it is all
read only after creation is a feature implementation detail that
allows fsverity to "get away with" storing it's metadata in file
data space.

But whether or not fsverity is enabled on the filesystem, the fact
is that the kernel code now has to support storing and reading data
from beyond EOF. Every user, whether they are using fsverity or not,
is now exposed to that code and a filesystem that no longer
considers the user data region beyond EOF as write only. i.e. it
doesn't matter if fsverity is in use, then ext4/f2fs code now
allows reading of information beyond EOF from user data files

i.e. you've completely changed the way files appear to /everyone/,
not just the users of fsverity. Not only that, you now have file
data that has a specific metadata on-disk format encoded into file
data space. That greatly complicates filesystem checking and
scrubbing which typically /doesn't even look at the contents of
user data/. So yeah, this hack might make the merkle tree
verification "simple" but it greatly complicates everything else the
filesystem has to do.

That's the problem here - fsverity completely redefines the layout
of user data files for everyone, not just fsverity, and not just the
filesystems that implement fsverity. You've taken an ext4 fsverity
implementation feature and promoted it to being a linux-wide 
file data layout standard that is encoded into the kernel/user
ABI/API forever more.

And you're trying to force this into the tree on everyone without
adequate review because "a product is already shipping with this
code in it". Didn't we learn the lessons of failing to "upstream
first" new features more than 15 years ago?

> As far as whether it's "gross" --- that's a taste question, and I
> happen to think it's more "clever" than "gross". 

You think it's clever because it's a neat hack that makes what you
need simple to implement, and so you can ship it in phones quickly
and without needing to involve upstream in more complex design
discussions.

I think it's gross because it bleeds implementation details all over
the API and globally redefines the user data file layout for
everyone, kernel wide in a manner that is incompatible with existing
filesystem implementations.

> It allows for a very
> simple implementation, *leveraging* the fact that the file will never
> change --- and especially, grow in length.  So why not use the space
> after EOF?

There have been many technical reasons given for why it's a bad interfaces,
yet you only address entirely subjective arguments and claim that
you have "good taste" in APIs because it is "clever".

> The alternative requires adding Solaris-style alternate data streams
> support.

No, it does not. It simply requires a different userspace API to
move the merkle tree data into the kernel, and a different
implemetnation abstraction that allows filesystems to provide the
merkle tree data pages on request. Darrick and Christoph have
already suggested alternative user APIs that would work just fine,
and they don't ahve a requirement that the merkle tree is held in
the user data space beyond EOF.

How filesystems store and retrieve merkle tree data should be a
filesystem internal detail. If how metadata is stored in th e
filesystem is defined by the userspace API or the kernel library
code that implements the verification feature, then it lacks the
necessary abstraction to be a generic Linux filesystem feature.
IOWs, it needs to be redesigned and reworked before we should
consider it for merging.

Cheers,

Dave.
Theodore Ts'o Dec. 20, 2018, 10:01 p.m. UTC | #19
On Thu, Dec 20, 2018 at 08:35:52AM +1100, Dave Chinner wrote:
> 
> The file has to be written before it has been protected, which means
> it may very well have user space allocated beyond EOF before the
> merkle tree needs to be written.

Sure, and every file system knows how to truncate a file.  This isn't
hard.

> But whether or not fsverity is enabled on the filesystem, the fact
> is that the kernel code now has to support storing and reading data
> from beyond EOF. Every user, whether they are using fsverity or not,
> is now exposed to that code and a filesystem that no longer
> considers the user data region beyond EOF as write only.

That's simply not true.  Number one, fsverity is not mandatory for all
file systems to implement.  If XFS doesn't want to implement fscrypt
or fsverity, it doesn't have to.  Number two, we're not *making* any
changes to the kernel code; nothing in mm/filemap.c, et. al.  So
saying that we are making changes that are impacted by /everyone/ just
doesn't make any sense.

> How filesystems store and retrieve merkle tree data should be a
> filesystem internal detail. If how metadata is stored in th e
> filesystem is defined by the userspace API or the kernel library
> code that implements the verification feature, then it lacks the
> necessary abstraction to be a generic Linux filesystem feature.
> IOWs, it needs to be redesigned and reworked before we should
> consider it for merging.

I disagree with your aesthetics that the interface has to be
completely isolated from the implementation.  If you don't want to
call it a generic file system feature, fine.  It can just be something
that f2fs and ext4 uses.

					- Ted
Christoph Hellwig Dec. 21, 2018, 7:04 a.m. UTC | #20
On Thu, Dec 20, 2018 at 05:01:58PM -0500, Theodore Y. Ts'o wrote:
> That's simply not true.  Number one, fsverity is not mandatory for all
> file systems to implement.  If XFS doesn't want to implement fscrypt
> or fsverity, it doesn't have to.  Number two, we're not *making* any
> changes to the kernel code; nothing in mm/filemap.c, et. al.  So
> saying that we are making changes that are impacted by /everyone/ just
> doesn't make any sense.

Ted, I think you know yourself this isn't true.  Whenever we added
useful interface to one of the major file systems we had other pick
it up, and that is a good thing because the last thing we need is
fragmentation of interfaces.  And even if that wasn't the case I don't
think we should take short cuts, because even if an interface was just
for a file system or two it still needs to be properly desgined.

There is no reason to rush interfacs in, because everytime we have done
that it has turned out to be a very bad idea in retrospective.
Richard Weinberger Dec. 21, 2018, 10:06 a.m. UTC | #21
On Fri, Dec 21, 2018 at 9:58 AM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Thu, Dec 20, 2018 at 05:01:58PM -0500, Theodore Y. Ts'o wrote:
> > That's simply not true.  Number one, fsverity is not mandatory for all
> > file systems to implement.  If XFS doesn't want to implement fscrypt
> > or fsverity, it doesn't have to.  Number two, we're not *making* any
> > changes to the kernel code; nothing in mm/filemap.c, et. al.  So
> > saying that we are making changes that are impacted by /everyone/ just
> > doesn't make any sense.
>
> Ted, I think you know yourself this isn't true.  Whenever we added
> useful interface to one of the major file systems we had other pick
> it up, and that is a good thing because the last thing we need is
> fragmentation of interfaces.  And even if that wasn't the case I don't
> think we should take short cuts, because even if an interface was just
> for a file system or two it still needs to be properly desgined.
>
> There is no reason to rush interfacs in, because everytime we have done
> that it has turned out to be a very bad idea in retrospective.

Speaking of interfaces, one thing that needs IMHO more thought is the
user facing interface. Not only in the fsverity case, in all cases.

Linux has currently many different ways to implement and check
(cryptographic)-integrity.
Just to name a few, fsverity, UBIFS' auth feature, BTRFS csum,
EVM/IMA, dm-integrity,
dm-verity, ...

At least for filesystems it would be good to have a common interface
to query the
integrity status of a file.
So far we have mixture of different return codes (EPERM, EUCLEAN, EINVAL),
audit events plus cryptic kernel logs.

In UBIFS we return EPERM in case of an auth failure, we followed EVM/IMA.
fsverity goes the EUCLEAN route, just like BTRFS for check sum failures.

Before everything is set in stone, let's try to consolidate at least this.

Along with that, what about the statx() interface?
We have already STATX_ATTR_ENCRYPTED, why no STATX_ATTR_AUTHENTICATED?
Theodore Ts'o Dec. 21, 2018, 3:47 p.m. UTC | #22
On Thu, Dec 20, 2018 at 11:04:47PM -0800, Christoph Hellwig wrote:
> Ted, I think you know yourself this isn't true.  Whenever we added
> useful interface to one of the major file systems we had other pick
> it up, and that is a good thing because the last thing we need is
> fragmentation of interfaces.  And even if that wasn't the case I don't
> think we should take short cuts, because even if an interface was just
> for a file system or two it still needs to be properly desgined.

This is why I think the interface argument is totally bogus.

If you're OK with Darrick's suggested interface, where you pass in a
file descriptor, offset and length --- that's just a superset of the
current interface, except where the file descriptor is in the file
which is going to be protected using fs-verity.  So there's if you're
OK with that interface, we can add that interface later, and it's
really no big deal; it certainly doesn't add any extra complexity for
XFS --- assuming that XFS even gets around to adding support for
fs-verity.

Adding that extra complexity is not necessary for the current users of
the interface, and as I've said multiple times before, there's no
*value* in allowing the Merkle tree to be passed in via some arbitrary
file descriptor, which might even be on a separate fhile system, as
opposed in the file which is about to be protected using fs-verity.

Linus --- we're going round and round, and I don't think this is
really a technical dispute at this point, but rather an aesthetics
one.  Will you be willing to accept my pull request for a feature
which is being shippped on millions of Android phones, has been out
for review for months, and for which, if we *really* need to add
uselessly complicated interface later, we can do that?  It's always
been the case for internal Kernel interfaces not to add code "just in
case" it's useful, but rather when a user turns up.  I argue we should
be doing the same thing for user-space visible interfaces.

Regards,

						- Ted
Matthew Wilcox Dec. 21, 2018, 3:53 p.m. UTC | #23
On Fri, Dec 21, 2018 at 10:47:14AM -0500, Theodore Y. Ts'o wrote:
> Linus --- we're going round and round, and I don't think this is
> really a technical dispute at this point, but rather an aesthetics
> one.  Will you be willing to accept my pull request for a feature
> which is being shippped on millions of Android phones, has been out
> for review for months, and for which, if we *really* need to add
> uselessly complicated interface later, we can do that?  It's always
> been the case for internal Kernel interfaces not to add code "just in
> case" it's useful, but rather when a user turns up.  I argue we should
> be doing the same thing for user-space visible interfaces.

To look at it another way, this is an aesthetic dispute in which all those
who have offered opinions from outside Google -- myself, Dave Chinner &
Christoph all really dislike this interface.  I'd be happy to discuss
alternative interfaces, particularly ones which allow for the current
internal implementation, but I think this interface is really bad.

In contrast to "we'll just fix it up later" (which usually applies
to in-kernel interfaces), we have a policy of not breaking userspace,
so accepting this interface means setting it in stone.  We should get
it right.
Matthew Wilcox Dec. 21, 2018, 4:11 p.m. UTC | #24
On Thu, Nov 01, 2018 at 03:52:19PM -0700, Eric Biggers wrote:
> +In the recommended configuration of SHA-256 and 4K blocks, 128 hash
> +values fit in each block.  Thus, each level of the hash tree is 128
> +times smaller than the previous, and for large files the Merkle tree's
> +size converges to approximately 1/129 of the original file size.

I think you mean 1/127, not 1/129.

> +fsveritysetup format
> +--------------------
> +
> +When enabling fs-verity on a file via the `FS_IOC_ENABLE_VERITY`_
> +ioctl, the kernel requires that the verity metadata has been appended
> +to the file contents.  Specifically, the file must be arranged as:
> +
> +#. Original file contents
> +#. Zero-padding to next block boundary
> +#. `Merkle tree`_
> +#. `fs-verity descriptor`_
> +#. fs-verity footer
> +
> +We call this file format the "fsveritysetup format".  It is not
> +necessarily the on-disk format actually used by the filesystem, since
> +the filesystem is free to move things around during the ioctl.
> +However, the easiest way to implement fs-verity is to just keep this
> +arrangement in-place, as ext4 and f2fs do; see `Filesystem support`_.
> +
> +Note that "block" here means the fs-verity block size, which is not
> +necessarily the same as the filesystem's block size.  For example, on
> +ext4, fs-verity can use 4K blocks on top of a filesystem formatted to
> +use a 1K block size.
> +
> +The fs-verity footer is a structure of the following format::
> +
> +    struct fsverity_footer {
> +            __le32 desc_reverse_offset;
> +            __u8 magic[8];
> +    };
> +
> +``desc_reverse_offset`` is the distance in bytes from the end of the
> +fs-verity footer to the beginning of the fs-verity descriptor; this
> +allows software to find the fs-verity descriptor.  ``magic`` is the
> +ASCII bytes "FSVerity"; this allows software to quickly identify a
> +file as being in the "fsveritysetup" format as well as find the
> +fs-verity footer if zeroes have been appended.
> +
> +The kernel cannot handle fs-verity footers that cross a page boundary.
> +Padding must be prepended as needed to meet this constaint.

I think this ioctl is the start of the disagreement.  How about this
strawman:

verity_fd = ioctl(fd, FS_IOC_VERITY_FD);
write(verity_fd, &merkle_tree);
close(verity_fd);

At final close of that verity_fd, the filesystem behaves in the same way
that it does on receipt of this FS_IOC_ENABLE_VERITY ioctl today.

> +FS_IOC_MEASURE_VERITY
> +---------------------
> +
> +The FS_IOC_MEASURE_VERITY ioctl retrieves the fs-verity measurement of
> +a regular file.  This is a digest that cryptographically summarizes
> +the file contents that are being enforced on reads.  The file must
> +have fs-verity enabled.
> +
> +This ioctl takes in a pointer to a variable-length structure::
> +
> +    struct fsverity_digest {
> +            __u16 digest_algorithm;
> +            __u16 digest_size; /* input/output */
> +            __u8 digest[];
> +    };
> +
> +``digest_size`` is an input/output field.  On input, it must be
> +initialized to the number of bytes allocated for the variable-length
> +``digest`` field.
> +
> +On success, 0 is returned and the kernel fills in the structure as
> +follows:
> +
> +- ``digest_algorithm`` will be the hash algorithm used for the file
> +  measurement.  It will match the algorithm used in the Merkle tree,
> +  e.g. FS_VERITY_ALG_SHA256.  See ``include/uapi/linux/fsverity.h``
> +  for the list of possible values.
> +- ``digest_size`` will be the size of the digest in bytes, e.g. 32
> +  for SHA-256.  (This can be redundant with ``digest_algorithm``.)
> +- ``digest`` will be the actual bytes of the digest.
> +
> +This ioctl is guaranteed to be very fast.  Due to fs-verity's use of a
> +Merkle tree, its running time is independent of the file size.
> +
> +This ioctl can fail with the following errors:
> +
> +- ``EFAULT``: invalid buffer was specified
> +- ``ENODATA``: the file is not a verity file
> +- ``ENOTTY``: this type of filesystem does not implement fs-verity
> +- ``EOPNOTSUPP``: the kernel was not configured with fs-verity support
> +  for this filesystem, or the filesystem superblock has not had the
> +  'verity' feature enabled on it.  (See `Filesystem support`_.)
> +- ``EOVERFLOW``: the file measurement is longer than the specified
> +  ``digest_size`` bytes.  Try providing a larger buffer.

Should this ioctl be better implemented as an xattr?

> +- Direct I/O is not supported on verity files.  Attempts to use direct
> +  I/O on such files will fall back to buffered I/O.

That makes sense; the filesystem can't verify the data before presenting
it to userspace if it's being copied directly into userspace.

> +- DAX (Direct Access) is not supported on verity files.

That makes less sense.  The kernel can check the checksum before
copying the data to the user.  Is this simply a current limitation of
the implementation?

> +Thus, when ascending the tree reading hash pages, fs-verity can stop
> +as soon as it finds an already-checked hash page.  This optimization,
> +which is also used by dm-verity, results in excellent sequential read
> +performance since usually the deepest needed hash page will already be
> +cached and checked.  However, random reads perform worse.

I think you mean "all but the deepest"?
Theodore Ts'o Dec. 21, 2018, 4:28 p.m. UTC | #25
On Fri, Dec 21, 2018 at 07:53:54AM -0800, Matthew Wilcox wrote:
> In contrast to "we'll just fix it up later" (which usually applies
> to in-kernel interfaces), we have a policy of not breaking userspace,
> so accepting this interface means setting it in stone.  We should get
> it right.

I'm not convinced it's a "fix", but my point is that if later on you
want to add extra complexity transforming

     ioctl(fd, FS_IOC_ENABLE_VERITY);

so it does the equivalent of

     ioctl(fd, FS_IOC_ENABLE_VERITY_NOW_WITH_EXTRA_USELESS_COMPLEXITY,
	   fd, sizeof_data, sizeof_verity_data);

it adds essentially no complexity to provide this backwards
compatibility.  But if we need to implement
FS_IOC_ENABLE_VERITY_NOW_WITH_EXTRA_USELESS_COMPLEXITY *now*, we gain
nothing, other than pushing back when fsverity lands upstream.  We'd
have to provide that backwards compatibility interface anyway, since
there are a lot of users for that existing interface.

So why?

				- Ted
Matthew Wilcox Dec. 21, 2018, 4:34 p.m. UTC | #26
On Fri, Dec 21, 2018 at 11:28:13AM -0500, Theodore Y. Ts'o wrote:
> On Fri, Dec 21, 2018 at 07:53:54AM -0800, Matthew Wilcox wrote:
> > In contrast to "we'll just fix it up later" (which usually applies
> > to in-kernel interfaces), we have a policy of not breaking userspace,
> > so accepting this interface means setting it in stone.  We should get
> > it right.
> 
> I'm not convinced it's a "fix", but my point is that if later on you
> want to add extra complexity transforming
> 
>      ioctl(fd, FS_IOC_ENABLE_VERITY);
> 
> so it does the equivalent of
> 
>      ioctl(fd, FS_IOC_ENABLE_VERITY_NOW_WITH_EXTRA_USELESS_COMPLEXITY,
> 	   fd, sizeof_data, sizeof_verity_data);

I disagree with your EXTRA_USELESS_COMPLEXITY appendage.  The interface
you designed reflects the implementation you did in ext4, so I understand
why it seems simple from your point of view.  From the user point of view,
it looks completely weird.  You write a file, being a series of bytes,
then all of a sudden have to know that it's composed of blocks, seek
to the next block, write a header, then this Merkle data structure,
then write a footer which isn't allowed to cross a block boundary
for some unknowable reason.  It seems much more logical to have the
header+Merkle+footer as a separate data stream which the filesystem can
then layout according to its own rules.
Linus Torvalds Dec. 21, 2018, 7:13 p.m. UTC | #27
On Fri, Dec 21, 2018 at 7:47 AM Theodore Y. Ts'o <tytso@mit.edu> wrote:
>
> Linus --- we're going round and round, and I don't think this is
> really a technical dispute at this point, but rather an aesthetics
> one.

Grr.

So honestly, I personally *like* the model of "the file contains its
own validation data" model. I think that's the right model, so that
you can then basically just do "enable verification on this file, and
verify that the root hash is this".

So that part I like. I think the people who argue for "let's have a
separate interface that writes the merkle tree data" are completely
wrong.

HOWEVER.

I do agree that your particular model is pretty damn broken in lots of ways.

Why is it filesystem specific? If the whole point is that the file
itself has its own verification data (which I like), then I don't see
why this is then documented as some filesystem-specific layout model.
That's complete and utter garbage.

In other words: either the model is that the file *itself* contains
its own merkle tree that validates the file, or it isn't. You can't
have it two ways. No silly "layout changes when you apply the hash"
garbage. That's just crazy talk and invalidates the whole model.

And honestly, I still think that it's very odd to add the merge data
to the end, when the filesystem already supports xattrs. It would have
made much more sense to just make one xattr contain the merkle tree
validation data.

So why is this sold as some unholy mess of "filesystem-specific" and
"generic"? That part just annoys the hell out of me. Why isn't this
sold as an *actual* generic model, where you just say "append the
merkle tree to the file, then enable verity testing of the end result
and validate the top-level hash".

That kind of thing could be done with absolutely _zero_ per-filesystem
code, and made 100% generic, and we'd just verify the merge data in
readpages().

So what's the excuse for doing the crazy odd "let's just support one
single filesystem" model?

                    Linus
Theodore Ts'o Dec. 22, 2018, 4:17 a.m. UTC | #28
On Fri, Dec 21, 2018 at 11:13:07AM -0800, Linus Torvalds wrote:
> 
> I do agree that your particular model is pretty damn broken in lots of ways.
> 
> Why is it filesystem specific? If the whole point is that the file
> itself has its own verification data (which I like), then I don't see
> why this is then documented as some filesystem-specific layout model.
> That's complete and utter garbage.
> 
> In other words: either the model is that the file *itself* contains
> its own merkle tree that validates the file, or it isn't. You can't
> have it two ways. No silly "layout changes when you apply the hash"
> garbage. That's just crazy talk and invalidates the whole model.

Userspace applications which are reading the file aren't going to be
expecting Merkle tree.  For example, one of the use cases is Android
APK files, which are essentially ZIP files.  ZIP files can be parsed
both from the front-end (streaming), or by looking for the complete
directory of all of the files in the ZIP file by starting at the end
of the file and moving backwards.  If the Merkle tree was visible to
userspace programs that are opening and reading the file, it would
confuse them mightily.

So what we do for ext4 and f2fs is make the Merkle tree invisible; if
userspace stats the file, st_size will return size of the original
"data" file, and reading beyond the st_size from userspace will behave
like reading beyond EOF.  From the *file system's* perspective,
though, the metadata blocks are part of the file.  There's just a
difference between the userspace visible EOF and the file system's
conception of EOF.  I don't consider this a "layout change", and I
personally believe this should be just *fine* for all file systems.
The XFS developers are convinced that this is horrific, and no one
sane should do this.  OK, call me insane.  But it works, and I think
it's elegant and clean.

So if *they* want to use some other layout, where the Merkle blocks
are stored in some Alternate Data Stream, ala NTFS --- they are *free*
to do that.  It will require more work, and at that point, it will
require a layout change.  But it's Dave and Christoph who are
insisting on doing that; not me!

> And honestly, I still think that it's very odd to add the merge data
> to the end, when the filesystem already supports xattrs. It would have
> made much more sense to just make one xattr contain the merkle tree
> validation data.

The problem is that xattrs are designed to be accessed via a set/get
interface, are currently limited, IIRC at 32k.  The max size of an APK
is 300 megabytes; and the Merkle tree for a file that size will be
about 2.3 megabytes.  That's way too big to store as an xattr;
certainly using the existing xattr interfaces.  And it's also bigger
than most file systems can handle as xattrs today --- because they've
been optimzied for relatively small sizes, for things like SELinux
labels and ACL structures.

> So why is this sold as some unholy mess of "filesystem-specific" and
> "generic"? That part just annoys the hell out of me. Why isn't this
> sold as an *actual* generic model, where you just say "append the
> merkle tree to the file, then enable verity testing of the end result
> and validate the top-level hash".

That was the original way it was sold, but Cristoph and Dave have
NACK'ed it in that form.  The common fsverity code which is generic to
ext4 and f2fs does treat it that way, with the note that we "lie" to
userspace about is the size of the file and where the EOF is.  Dave
and Cristoph have declaimed strongly that this is this layout choice
is horrible, and filesystem specific, and XFS could never do it that
way.  I don't understand why, but they are the XFS experts.  So if
they want to do something else, what I've been trying to point out is
that they can do that, using the existing interface.

> So what's the excuse for doing the crazy odd "let's just support one
> single filesystem" model?

Android devices use both ext4 and f2fs; it's the manufacturer's
choice.  So we wanted fs-verity to support both.  And we didn't want
to duplicate code across ext4 and f2fs; hence trying to put common
code in fs/verity.  So we aren't supporting one file system out of the
gate; we're supporting two.

Whether XFS wants to implement fs-verity is purely XFS's choice.  XFS
has chosen not to support fscrypt, which is currently used by ext4,
f2fs, and ubifs, and both fscrypt's and fs-verity's initial use case
has been for Android, which is not an area where XFS has proven to be
a common choice.

So I was not really expecting that they would have any interest in
fs-verity.  But they seem to have very strong opinions about how they
would want to implement it, and it's different from what we have in
the current "generic code shared by ext4 and f2fs".  I was trying to
show that even if they wanted to do things in this different way ---
and I don't understand why it's so important to them --- it would be
possible to do so.

Cheers,

					- Ted
Linus Torvalds Dec. 22, 2018, 10:47 p.m. UTC | #29
On Fri, Dec 21, 2018 at 8:20 PM Theodore Y. Ts'o <tytso@mit.edu> wrote:
>
> On Fri, Dec 21, 2018 at 11:13:07AM -0800, Linus Torvalds wrote:
> >
> > In other words: either the model is that the file *itself* contains
> > its own merkle tree that validates the file, or it isn't. You can't
> > have it two ways. No silly "layout changes when you apply the hash"
> > garbage. That's just crazy talk and invalidates the whole model.
>
> Userspace applications which are reading the file aren't going to be
> expecting Merkle tree.  For example, one of the use cases is Android
> APK files, which are essentially ZIP files.  ZIP files can be parsed
> both from the front-end (streaming), or by looking for the complete
> directory of all of the files in the ZIP file by starting at the end
> of the file and moving backwards.  If the Merkle tree was visible to
> userspace programs that are opening and reading the file, it would
> confuse them mightily.
>
> So what we do for ext4 and f2fs is make the Merkle tree invisible

Again, this has nothing that is per-filesystem in it.

If we were to decide to support the notion of "append merkle hashes to
the file for validation" at the vfs layer, the same logic would apply:
obviously the merkle data shouldn't be visible to user space.

But that's not a reason to do it at a filesystem layer, quite the
reverse: exactly like you say, as far as the *filesystem* is
concerned, the data is there in the file. It's literally about the
*view* of the file, ie the system call interface:

>    From the *file system's* perspective,
> though, the metadata blocks are part of the file.

To me that only argues that this all should be at the vfs layer, and
that it shouldn't be the filesystem that hides it. Exactly because as
far as the filesystem is concerned, the merkle data is there, it's
just that we hide it at read (and stat) time.

Preferably some way where it's namespace-dependent or whatever, so
that you could still access the original file data from user space if
you want to (eg some backup purpose or other).

What I'm missing is any kind of sane explanation for why it was done
so badly, and why it should be upstreamed despite the apparent bad
implementation.

It sounds like a complete hack.

Again, to me either the point is that it's a generic extension of the
file data, _or_ it's some filesystem-specific hidden data. The way
you've done it and written the documentation, it's clearly a generic
extension of normal file data, and I don't see what's fs-specific to
it.

> The problem is that xattrs are designed to be accessed via a set/get
> interface, are currently limited, IIRC at 32k.  The max size of an APK
> is 300 megabytes; and the Merkle tree for a file that size will be
> about 2.3 megabytes.  That's way too big to store as an xattr;
> certainly using the existing xattr interfaces.  And it's also bigger
> than most file systems can handle as xattrs today --- because they've
> been optimzied for relatively small sizes, for things like SELinux
> labels and ACL structures.

So *this* kind of argument is what I'm looking for.

That at least explains why it's not an xattr. Ugly, but understandable.

> > So why is this sold as some unholy mess of "filesystem-specific" and
> > "generic"? That part just annoys the hell out of me. Why isn't this
> > sold as an *actual* generic model, where you just say "append the
> > merkle tree to the file, then enable verity testing of the end result
> > and validate the top-level hash".
>
> That was the original way it was sold, but Cristoph and Dave have
> NACK'ed it in that form.

That seems entirely irrelevant. What do Christoph and Dave have to do
with it once it's generic? It would have _zero_ filesystem component
if it's actually done in a generic manner. It would be a total no-op
to XFS.

Which makes me think "it wasn't actually sold as being
filesystem-independent" at all.

So I want to understand why this was made a filesystem operation in
the first place. What's fs-specific about this implementation?

                 Linus
Matthew Wilcox Dec. 23, 2018, 4:10 a.m. UTC | #30
On Fri, Dec 21, 2018 at 11:17:12PM -0500, Theodore Y. Ts'o wrote:
> Userspace applications which are reading the file aren't going to be
> expecting Merkle tree.  For example, one of the use cases is Android
> APK files, which are essentially ZIP files.  ZIP files can be parsed
> both from the front-end (streaming), or by looking for the complete
> directory of all of the files in the ZIP file by starting at the end
> of the file and moving backwards.  If the Merkle tree was visible to
> userspace programs that are opening and reading the file, it would
> confuse them mightily.

Pretty much every file format has the ability to put arbitrary blocks
of information into a file somewhere the tools which don't know about
it will skip it.  For example, ZIP "includes an extra field facility
within file headers, which can be used to store extra data not defined
by existing ZIP specifications, and which allow compliant archivers that
do not recognize the fields to safely skip them. Header IDs 0–31 are
reserved for use by PKWARE. The remaining IDs can be used by third-party
vendors for proprietary usage. " (Wikipedia)

ELF, PNG, PDF and many other formats have the ability to put data
_somewhere_.  It might not be at the tail of the file, but there's
somewhere to do it.

(I appreciate this isn't what Linus is asking for, but I'm pointing out
that this is by no means as intractable as you make it sound.)
Theodore Ts'o Dec. 23, 2018, 4:34 a.m. UTC | #31
On Sat, Dec 22, 2018 at 02:47:22PM -0800, Linus Torvalds wrote:
> So I want to understand why this was made a filesystem operation in
> the first place. What's fs-specific about this implementation?

These are the things which are fs-specific.

*) We have to splice into the file system's readpage processing so we
   can verify the merkle tree hash before we mark the page up-to-date.
   This is most of the complexity involved in adding fs-verity
   support, and that's because both ext4 and f2fs have their own
   fs-specific readpage[s]() implementations, and both ext4 and f2fs
   also supports fscrypt, which *also* has to splice into readpage[s]().

*) The file system needs to define a file system feature bit in the
   superblock which means, "this file system uses fs-verity" --- so
   that old kernels will know that they need to refuse to mount the
   file system (f2fs) or mount the file system read-only (ext4).

*) The file system needs to define inode flag which is used to
   indicate "this is a fs-verity protected file".  This flag is not
   user-visible, so the file system just has to provide a single bit
   in the inode structure and a function which tests that bit.

*) Ext4 chose to have i_size on disk to be size of the data.  We did
   this so that the the fs-verity feature for ext4 could be a
   read-only compat feature.  (e.g., an old kernel can safely mount a
   file system with fs-verity protected files, but only for a
   read-only mount.)  This adds a bit more complexity for ext4 in that
   we need to look up in our extent tree to find the last block in the
   file (which is where the fs-verity superblock is located).

   For f2fs, it can just use the on-disk i_size to find the fs-verity
   superblock, and then from that, f2fs can find the original data
   i_size (which then gets presents to userspace when it calls
   stat(2).)

As far as the last point is concerned, ext4 could have done things the
f2fs way, which is simpler, and which would allowed us to make things
much more generic.  However, being able to support read-only mounts of
file systems with fs-verity protected files was important to me.

Everything else is generic and we tried to factor out as much common
code as possible into fs/verity.  But the model has always been that
at least *some* changes would be needed in the file system to call out
to the fs-verity code, primarily because we didn't want to make
changes to readpage()/readpages() VFS<->low-level fs interface.  That
would have required making changes in dozens of file systems, and
while that would have allowed us to factor out some duplicated code in
{ext4,f2fs}_readpage[s]() --- right now it's only those two file
systems out of 70 or so support fscrypt and fs-verity.  It's just not
worth it.

						- Ted
Theodore Ts'o Dec. 23, 2018, 4:45 a.m. UTC | #32
On Sat, Dec 22, 2018 at 08:10:07PM -0800, Matthew Wilcox wrote:
> Pretty much every file format has the ability to put arbitrary blocks
> of information into a file somewhere the tools which don't know about
> it will skip it.  For example, ZIP "includes an extra field facility
> within file headers, which can be used to store extra data not defined
> by existing ZIP specifications, and which allow compliant archivers that
> do not recognize the fields to safely skip them. Header IDs 0–31 are
> reserved for use by PKWARE. The remaining IDs can be used by third-party
> vendors for proprietary usage. " (Wikipedia)
> 
> ELF, PNG, PDF and many other formats have the ability to put data
> _somewhere_.  It might not be at the tail of the file, but there's
> somewhere to do it.
> 
> (I appreciate this isn't what Linus is asking for, but I'm pointing out
> that this is by no means as intractable as you make it sound.)

That design would require the fs-verity code to know the type of eacho
file, and where to find the in-band Merkle tree for each file type
that we wanted to support.  And if you wanted to use fs-verity to
protect a sudoers text configuration file (for example), we'd have to
teach sudo how to ignore the userspace visible Merkle tree.

So I agree with you that it's *possible*.  But it's ***ugly***.  *Way*
uglier than putting the Merkle tree at the end of the file data and
then making it invisible to userspace.

Cheers,

						- Ted
Daniel Colascione Jan. 4, 2019, 8:41 p.m. UTC | #33
On Sat, Dec 22, 2018 at 8:46 PM Theodore Y. Ts'o <tytso@mit.edu> wrote:
>
> On Sat, Dec 22, 2018 at 08:10:07PM -0800, Matthew Wilcox wrote:
> > Pretty much every file format has the ability to put arbitrary blocks
> > of information into a file somewhere the tools which don't know about
> > it will skip it.  For example, ZIP "includes an extra field facility
> > within file headers, which can be used to store extra data not defined
> > by existing ZIP specifications, and which allow compliant archivers that
> > do not recognize the fields to safely skip them. Header IDs 0–31 are
> > reserved for use by PKWARE. The remaining IDs can be used by third-party
> > vendors for proprietary usage. " (Wikipedia)
> >
> > ELF, PNG, PDF and many other formats have the ability to put data
> > _somewhere_.  It might not be at the tail of the file, but there's
> > somewhere to do it.
> >
> > (I appreciate this isn't what Linus is asking for, but I'm pointing out
> > that this is by no means as intractable as you make it sound.)
>
> That design would require the fs-verity code to know the type of eacho
> file, and where to find the in-band Merkle tree for each file type
> that we wanted to support.  And if you wanted to use fs-verity to
> protect a sudoers text configuration file (for example), we'd have to
> teach sudo how to ignore the userspace visible Merkle tree.

I'm pretty late to the game, but I just want to bring up one approach
that I'm not sure people have previously considered. You can't put the
verification blob in an xattr due to xattr size limits, but you *can*
put a filename in an xattr. What if, at open time, fs-verity looked
for a specially-named xattr attached to a file, resolved that name
like a symlink target, opened the pointed-to file, and just used
*that* as the authentication blob? It'd also be possible to teach
unlink to delete the pointed-to file when the pointer file is deleted
--- sort of like a simple and stupid kind of data fork.

For example, if you wanted to secure /usr/bin/emacs, you could set an
security.fsverify.verification_file xattr (in the system namespace
because the xattr has special semantics) to
"/.verification-blobs/@usr@bin@emacs.hashtree" or something like that.
Then, open(2) on /usr/bin/emacs would, internally to VFS, also open
/.verification-blobs/@usr@bin@emacs.hashtree and read verification
data from it, transparently to both users and the underlying
filesystem. If someone deleted /usr/bin/emacs, VFS would automatically
delete /.verification-blobs/@usr@bin@emacs.hashtree. If
/.verification-blobs/@usr@bin@emacs.hashtree didn't exist at time of
open(2) of /usr/bin/emacs, or couldn't be opened for whatever reason,
the open(2) of /usr/bin/emacs would fail.

ISTM that a scheme like this would give you some of the advantages of
jumbo xattrs, but with much less implementation complexity. If
someone's proposed something like this before, sorry for the noise.
diff mbox series

Patch

diff --git a/Documentation/filesystems/fsverity.rst b/Documentation/filesystems/fsverity.rst
new file mode 100644
index 0000000000000..d633fc0567bd5
--- /dev/null
+++ b/Documentation/filesystems/fsverity.rst
@@ -0,0 +1,583 @@ 
+========================================================
+Read-only file-based authenticity protection (fs-verity)
+========================================================
+
+Introduction
+============
+
+fs-verity (``fs/verity/``) is a library that filesystems can hook into
+to support transparent integrity and authenticity protection of
+read-only files.  Currently, it is supported by the ext4 and f2fs
+filesystems.  Similar to fscrypt, not too much filesystem-specific
+code is needed to support fs-verity.
+
+fs-verity is similar to `dm-verity
+<https://www.kernel.org/doc/Documentation/device-mapper/verity.txt>`_
+but works on files rather than block devices.  On supported
+filesystems, userspace can append a Merkle tree (hash tree) to a file,
+then use an ioctl to enable fs-verity on it.  Then, the filesystem
+transparently verifies all data read from the file against the Merkle
+tree; reads that fail verification will fail.  The filesystem also
+hides or moves the Merkle tree, and forbids changes to the file's
+contents via the syscall interface.
+
+Essentially, fs-verity is a way of efficiently hashing a file, subject
+to the caveat that the enforcement of that hash happens on-demand as
+reads occur.  The file hash that fs-verity computes is called the
+"file measurement"; this is the hash of the Merkle tree's root hash
+and certain other fs-verity metadata, and it takes constant time to
+compute regardless of the size of the file.  Note: the value of the
+fs-verity file measurement will differ from a regular hash of the
+file, even when they use the same hash algorithm, e.g. SHA-256;
+however, they achieve the same purpose.
+
+Use cases
+=========
+
+In general, fs-verity does not replace or obsolete dm-verity.
+dm-verity should still be used when it is possible to authenticate the
+full block device, i.e. when the device is read-only.  fs-verity is
+intended for use on read-write filesystems where dm-verity cannot be
+used.
+
+fs-verity is most useful for hashing large files where only a small
+portion may be accessed.  For example, it's useful on Android
+application package (APK) files, which typically contain many
+translations, classes, and other resources that are infrequently or
+even never accessed on a particular device.  It would be wasteful to
+hash the entire file before starting the application.
+
+Unlike an ahead-of-time hash, fs-verity also re-verifies data each
+time it's paged in, which ensures the file measurement remains
+correctly enforced even if the file contents are modified from
+underneath the filesystem, e.g. by malicious disk firmware.
+
+fs-verity can support various use cases, such as:
+
+- Integrity protection (detecting accidental corruption)
+- Auditing (logging file hashes before use)
+- Authenticity protection (detecting malicious modification)
+
+Note that the latter two are not features of fs-verity per se, but
+rather fs-verity is a tool for supporting these use cases.  For
+example, for the overall system to actually provide authenticity
+protection, the file measurement itself must still be authenticated,
+e.g. by comparing it with a known good value or by verifying a digital
+signature of it.
+
+This can be userspace driven, in which case fs-verity will only be
+used (essentially) as a fast way of hashing the file contents, via the
+`FS_IOC_MEASURE_VERITY`_ ioctl.  For authenticity protection, trusted
+userspace code [#]_ still must verify the relevant portions of the
+untrusted filesystem state before it is used in a security-critical
+way, such as executing code from it.
+
+For example, the trusted userspace code might verify that the file
+located at ``/foo/bar/baz`` has an fs-verity file measurement of
+``sha256:a83d5cd722ef0d070b23353c2d9f316c38425114da8bd007cb9e8499371a97b3``,
+or that all security-critical files (e.g. executable code) have stored
+alongside them a valid digital signature (signed by a known, trusted
+public key) of their fs-verity file measurement, potentially combined
+with other important file metadata such as path and SELinux label.
+
+However, for ease of use, a subset of this policy logic (but not all
+of it!) is also supported in the kernel by the `Built-in signature
+verification`_ mechanism.  Support for fs-verity file hashes in IMA
+(Integrity Measurement Architecture) policies is also planned.
+
+.. [#] For example, on Android, "trusted userspace code" would be code
+       running from the system or vendor partitions, which are
+       read-only partitions authenticated by dm-verity tied into
+       Verified Boot, as opposed to the userdata partition which is
+       read-write.
+
+Metadata format
+===============
+
+Merkle tree
+-----------
+
+fs-verity uses the same Merkle tree (hash tree) format as dm-verity;
+the only difference is that fs-verity's Merkle tree is built over the
+contents of a regular file rather than a block device.
+
+Briefly, the file contents is divided into blocks, where the blocksize
+is configurable but usually 4096 bytes.  The last block is zero-padded
+if needed.  Each block is then hashed, producing the first level of
+hashes.  Then, the hashes in this first level are grouped into
+'blocksize'-byte blocks (zero-padding the ends as needed) and these
+blocks are hashed, producing the second level of hashes.  This
+proceeds up the tree until only a single block remains.  The hash of
+this block is called the "Merkle tree root hash".  Note: if the entire
+file contents fit in one block, then there are no hash blocks and the
+"Merkle tree root hash" is simply the hash of the data block.
+
+The blocks of the Merkle tree are stored on-disk starting from the
+root level and then proceeding to store each level down to the "first"
+(the level that gives the hashes of the data blocks).
+
+The hash algorithm is configurable.  The default is SHA-256, but
+SHA-512 is also supported.  The non-cryptographic checksum CRC-32C is
+also supported for integrity-only use cases such as detecting bit
+errors in read-only backup files.  A non-cryptographic checksum must
+not be used if authenticity protection is desired.
+
+In the recommended configuration of SHA-256 and 4K blocks, 128 hash
+values fit in each block.  Thus, each level of the hash tree is 128
+times smaller than the previous, and for large files the Merkle tree's
+size converges to approximately 1/129 of the original file size.
+However, for small files, the padding to a block boundary is
+significant, making the space overhead proportionally more.
+
+fs-verity descriptor
+--------------------
+
+For each file, fs-verity also uses an additional on-disk metadata
+structure called the *fs-verity descriptor*.  This contains the
+properties of the Merkle tree and some other information.  It begins
+with a header in the following format::
+
+    struct fsverity_descriptor {
+            __u8 magic[8];
+            __u8 major_version;
+            __u8 minor_version;
+            __u8 log_data_blocksize;
+            __u8 log_tree_blocksize;
+            __le16 data_algorithm;
+            __le16 tree_algorithm;
+            __le32 flags;
+            __le32 reserved1;
+            __le64 orig_file_size;
+            __le16 auth_ext_count;
+            __u8 reserved2[30];
+    };
+
+This structure contains:
+
+- ``magic`` is the ASCII bytes "FSVerity".
+- ``major_version`` is 1.
+- ``minor_version`` is 0.
+- ``log_data_blocksize`` and ``log_tree_blocksize`` are the log base 2
+  of the block size (in bytes) of data blocks and Merkle tree blocks,
+  respectively.  Currently, in both cases the kernel only supports
+  page-sized blocks, i.e. on most architectures, 4096-byte blocks.
+  Thus, usually both of these fields must be 12.
+- ``data_algorithm`` and ``tree_algorithm`` are the hash algorithms
+  used to hash data blocks and Merkle tree blocks, respectively.
+  Currently the kernel requires these to have the same value.  The
+  recommended value is FS_VERITY_ALG_SHA256.  See
+  ``include/uapi/linux/fsverity.h`` for the list of allowed values.
+- ``orig_file_size`` is the original size of the file in bytes.  This
+  means the size excluding the verity metadata and padding.
+- ``auth_ext_count`` is the number of authenticated extensions that
+  follow.
+- All other fields are zeroed.
+
+Following the ``struct fsverity_descriptor``, there is a list of
+"authenticated extensions".  Each extension is a variable-length
+structure that begins with the following header::
+
+    struct fsverity_extension {
+            __le32 length;
+            __le16 type;
+            __le16 reserved;
+    };
+
+This structure contains:
+
+- ``length`` is the length of this extension in bytes, including the
+  header.
+- ``type`` is the extension number.  See
+  ``include/uapi/linux/fsverity.h`` for the allowed values.
+- ``reserved`` must be 0.
+
+Each extension begins on an 8-byte aligned boundary.  When an
+extension's length is not a multiple of 8, it must be zero-padded to
+the next 8-byte boundary, even if it is the last extension.  This zero
+padding is not counted in the ``length`` field.
+
+This first list of extensions is "authenticated", meaning that they
+are included in the file measurement.  Currently, the following
+authenticated extensions are supported.  Except where otherwise
+indicated, extensions are optional and cannot be given multiple times:
+
+- FS_VERITY_EXT_ROOT_HASH:  This is mandatory.  It gives the root hash
+  of the Merkle tree, as a byte array.
+- FS_VERITY_EXT_SALT: A salt to salt the hashes with, given as a byte
+  array.  The salt is prepended to every block that is hashed.  Any
+  length salt is supported.  Note that using a unique salt for every
+  file should make it more difficult for fs-verity to be attacked
+  across many files.  However, in principle this is unnecessary since
+  simply choosing a strong cryptographic hash algorithm such as
+  SHA-256 or SHA-512 should be sufficient.
+
+Following the authenticated extensions, there is a list of
+unauthenticated extensions.  These are *not* included in the file
+measurement.  This list begins with a header::
+
+        __le16 unauth_ext_count;
+        __le16 padding[3];
+
+``unauth_ext_count`` is the number of unauthenticated extensions.
+This may be 0.
+
+Like authenticated extensions, each unauthenticated extension begins
+with the header ``struct fsverity_extension`` from above.
+
+The following types of unauthenticated extensions are supported:
+
+- FS_VERITY_EXT_PKCS7_SIGNATURE.  This is a DER-encoded PKCS#7 message
+  containing the signed file measurement.  See `Built-in signature
+  verification`_ for details.
+
+fsveritysetup format
+--------------------
+
+When enabling fs-verity on a file via the `FS_IOC_ENABLE_VERITY`_
+ioctl, the kernel requires that the verity metadata has been appended
+to the file contents.  Specifically, the file must be arranged as:
+
+#. Original file contents
+#. Zero-padding to next block boundary
+#. `Merkle tree`_
+#. `fs-verity descriptor`_
+#. fs-verity footer
+
+We call this file format the "fsveritysetup format".  It is not
+necessarily the on-disk format actually used by the filesystem, since
+the filesystem is free to move things around during the ioctl.
+However, the easiest way to implement fs-verity is to just keep this
+arrangement in-place, as ext4 and f2fs do; see `Filesystem support`_.
+
+Note that "block" here means the fs-verity block size, which is not
+necessarily the same as the filesystem's block size.  For example, on
+ext4, fs-verity can use 4K blocks on top of a filesystem formatted to
+use a 1K block size.
+
+The fs-verity footer is a structure of the following format::
+
+    struct fsverity_footer {
+            __le32 desc_reverse_offset;
+            __u8 magic[8];
+    };
+
+``desc_reverse_offset`` is the distance in bytes from the end of the
+fs-verity footer to the beginning of the fs-verity descriptor; this
+allows software to find the fs-verity descriptor.  ``magic`` is the
+ASCII bytes "FSVerity"; this allows software to quickly identify a
+file as being in the "fsveritysetup" format as well as find the
+fs-verity footer if zeroes have been appended.
+
+The kernel cannot handle fs-verity footers that cross a page boundary.
+Padding must be prepended as needed to meet this constaint.
+
+Filesystem support
+==================
+
+ext4
+----
+
+ext4 supports fs-verity since kernel version TODO.
+
+CONFIG_EXT4_FS_VERITY must be enabled in the kernel config.  Also, the
+filesystem must have been formatted with ``-O verity``, or had
+``tune2fs -O verity`` run on it.  These require e2fsprogs v1.44.4-2 or
+later.  This e2fsprogs version is also required for e2fsck to
+understand the verity feature.  Since "verity" is an RO_COMPAT
+feature, once enabled earlier kernels will be unable to mount the
+filesystem for writing, and earlier versions of e2fsck will be unable
+to check the filesystem.
+
+ext4 only allows fs-verity on extent-based files.
+
+The EXT4_VERITY_FL flag in the inode is used to indicate that the
+inode uses fs-verity.  This bit cannot be set directly; it can only be
+set indirectly via `FS_IOC_ENABLE_VERITY`_.
+
+When enabling verity on an inode, ext4 leaves the verity metadata
+in-place in the `fsveritysetup format`_.  However, it changes the
+on-disk i_size to the original file size, which allows the verity
+feature to be RO_COMPAT rather than INCOMPAT.  Later, the fs-verity
+footer is found by scanning backwards from the end of the last extent
+rather than from i_size.
+
+f2fs
+----
+
+f2fs supports fs-verity since kernel version TODO.
+
+CONFIG_F2FS_FS_VERITY must be enabled in the kernel config.  Also, the
+filesystem must have been formatted with ``-O verity``.  This requires
+f2fs-tools v1.11.0 or later.
+
+The FADVISE_VERITY_BIT flag in the inode is used to indicate that the
+inode uses fs-verity.  This bit cannot be set directly; it can only be
+set indirectly via `FS_IOC_ENABLE_VERITY`_.
+
+When enabling verity on an inode, f2fs leaves the verity metadata
+in-place in the `fsveritysetup format`_.  It leaves the on-disk i_size
+as the full file size; however, the in-memory i_size is overridden
+with the original size.
+
+User API
+========
+
+FS_IOC_ENABLE_VERITY
+--------------------
+
+The FS_IOC_ENABLE_VERITY ioctl enables fs-verity on a regular file.
+Userspace must have already appended verity metadata to the file,
+using the file format described in `fsveritysetup format`_.
+Additionally, the filesystem must support fs-verity.
+
+The argument parameter for this ioctl is reserved and must be NULL.
+
+This ioctl checks for write access to the inode; no capability is
+required.  However, it must be executed on an O_RDONLY file
+descriptor, and no processes may have the file open for writing.
+(This is necessary to prevent various race conditions.)
+
+On success, this ioctl returns 0, and the file becomes a verity file.
+This means that:
+
+- The filesystem marks the file as a verity file both in-memory and
+  on-disk, e.g. by setting a bit in the inode.
+- All later reads from the file are verified against the Merkle tree.
+- The verity metadata at the end of the file is hidden or moved.
+- Opening the file for writing or truncating it is no longer allowed.
+- There is no way to disable verity on the file, other than by
+  deleting it and replacing it with a copy.
+
+If this ioctl fails, then no changes are made to the file.  The
+reasons it might fail include:
+
+- ``EACCES``: the process does not have write access to the file
+- ``EBADMSG``: the file's fs-verity metadata is invalid
+- ``EEXIST``: the file already has fs-verity enabled
+- ``EINVAL``: a value was specified for the reserved argument
+  parameter, or the file descriptor refers to neither a regular file
+  nor a directory
+- ``EIO``: an I/O error occurred
+- ``EISDIR``: the file descriptor refers to a directory, not a regular
+  file
+- ``ENOTTY``: this type of filesystem does not implement fs-verity
+- ``EOPNOTSUPP``: the kernel was not configured with fs-verity support
+  for this filesystem, or the filesystem superblock has not had the
+  'verity' feature enabled on it.  (See `Filesystem support`_.)
+- ``EPERM``: the file is append-only
+- ``EROFS``: the filesystem is read-only
+- ``ETXTBSY``: the file is open for writing.  Note that this can be
+  the caller's file descriptor, or another open file descriptor, or
+  the file reference held by a writable memory map.
+
+FS_IOC_MEASURE_VERITY
+---------------------
+
+The FS_IOC_MEASURE_VERITY ioctl retrieves the fs-verity measurement of
+a regular file.  This is a digest that cryptographically summarizes
+the file contents that are being enforced on reads.  The file must
+have fs-verity enabled.
+
+This ioctl takes in a pointer to a variable-length structure::
+
+    struct fsverity_digest {
+            __u16 digest_algorithm;
+            __u16 digest_size; /* input/output */
+            __u8 digest[];
+    };
+
+``digest_size`` is an input/output field.  On input, it must be
+initialized to the number of bytes allocated for the variable-length
+``digest`` field.
+
+On success, 0 is returned and the kernel fills in the structure as
+follows:
+
+- ``digest_algorithm`` will be the hash algorithm used for the file
+  measurement.  It will match the algorithm used in the Merkle tree,
+  e.g. FS_VERITY_ALG_SHA256.  See ``include/uapi/linux/fsverity.h``
+  for the list of possible values.
+- ``digest_size`` will be the size of the digest in bytes, e.g. 32
+  for SHA-256.  (This can be redundant with ``digest_algorithm``.)
+- ``digest`` will be the actual bytes of the digest.
+
+This ioctl is guaranteed to be very fast.  Due to fs-verity's use of a
+Merkle tree, its running time is independent of the file size.
+
+This ioctl can fail with the following errors:
+
+- ``EFAULT``: invalid buffer was specified
+- ``ENODATA``: the file is not a verity file
+- ``ENOTTY``: this type of filesystem does not implement fs-verity
+- ``EOPNOTSUPP``: the kernel was not configured with fs-verity support
+  for this filesystem, or the filesystem superblock has not had the
+  'verity' feature enabled on it.  (See `Filesystem support`_.)
+- ``EOVERFLOW``: the file measurement is longer than the specified
+  ``digest_size`` bytes.  Try providing a larger buffer.
+
+Access semantics
+================
+
+fs-verity only implements reads, not writes.  Therefore, after it is
+enabled on a given file, regardless of the mode bits filesystems will
+forbid opening the file for writing as well as changing the size of
+the file via truncate().  The error code received for this is EPERM.
+
+However, fs-verity does not measure metadata such as owner, mode,
+timestamps, and xattrs.  Therefore, changes to these are still
+allowed.
+
+For read-only access, fs-verity is intended to be transparent; no
+changes to userspace applications should be needed.  However, astute
+users may notice some slight differences in behavior:
+
+- Direct I/O is not supported on verity files.  Attempts to use direct
+  I/O on such files will fall back to buffered I/O.
+
+- DAX (Direct Access) is not supported on verity files.
+
+Note: read-only mmaps are supported, as is combining fs-verity and
+fscrypt.
+
+Verity files can be sparse; holes are still verified.
+
+In-kernel policies
+==================
+
+Built-in signature verification
+-------------------------------
+
+With CONFIG_FS_VERITY_BUILTIN_SIGNATURES=y, fs-verity supports putting
+a portion of an authentication policy (see `Use cases`_) in the
+kernel.  Specifically, it adds support for:
+
+1. At fs-verity module initialization time, a keyring ".fs-verity" is
+   created.  The root user can add trusted X.509 certificates to this
+   keyring using the add_key() system call, then (when done)
+   optionally use keyctl_restrict_keyring() to prevent additional
+   certificates from being added.
+
+2. When a PKCS7_SIGNATURE extension containing a signed file
+   measurement is found in a file's verity metadata, the kernel will
+   verify this signature against the certificates in the ".fs-verity"
+   keyring, and verify that it matches the actual file measurement.
+   The extension must contain the PKCS#7 formatted signature in DER
+   format, where the signed data is the file measurement as a ``struct
+   fsverity_digest`` as described for `FS_IOC_MEASURE_VERITY`_ except
+   that all fields must be little-endian rather than native endian.
+
+3. A new sysctl "fs.verity.require_signatures" is made available.
+   When set to 1, the kernel requires that all fs-verity files have a
+   correctly signed file measurement as described in (2).
+
+This is meant as a relatively simple mechanism that can be used to
+provide some level of authenticity protection for fs-verity files, as
+an alternative to doing the signature verification in userspace or
+using IMA-appraisal.  However, with this mechanism, userspace programs
+still need to check that the fs-verity bit is set, and there is no
+protection against fs-verity files being swapped around.
+
+Implementation details
+======================
+
+I/O path design
+---------------
+
+To support fs-verity, the filesystem's ``->readpage()`` and
+``->readpages()`` methods are modified to verify the data pages before
+they are marked Uptodate.  Merely hooking ``->read_iter()`` would be
+insufficient, since ``->read_iter()`` is not used for memory maps.
+fs-verity exposes functions to verify data:
+
+- ``fsverity_verify_page()`` verifies an individual page
+- ``fsverity_verify_bio()`` verifies all pages in a bio
+
+Currently, fs-verity only supports the case where data blocks, hash
+blocks, and pages all have the same size (usually 4096 bytes).
+
+Filesystems that use bios call ``fsverity_verify_bio()`` after each
+read bio completes.  To do this while also continuing to support
+encryption (fscrypt), filesystems allocate a "post-read context" for
+each bio and store it in ``->bi_private``::
+
+    struct bio_post_read_ctx {
+           struct bio *bio;
+           struct work_struct work;
+           unsigned int cur_step;
+           unsigned int enabled_steps;
+    };
+
+``enabled_steps`` is a bitmask of the post-read steps that are
+enabled.  The available steps are STEP_DECRYPT and STEP_VERITY.  These
+steps can be enabled together, independently, or not at all.  If both
+are enabled, then decryption is done first.  Since bio completion
+callbacks cannot sleep, each post-read step is done by enqueueing the
+struct on a workqueue, and then actual work happens in the work item.
+Different workqueues are needed for encryption and verity because
+verity work may require decrypting metadata pages from the file.
+
+The bio completion callback sets PG_error for each page if either
+decryption or verification failed.  Finally, after the work item(s)
+complete, pages without PG_error are set Uptodate, and all pages are
+unlocked.
+
+A data page being set Uptodate and unlocked implies that it has been
+verified, and such pages become visible to userspace via read(),
+mmap(), etc.  Otherwise, the page is left in the PG_error && !Uptodate
+state which results in the read() family of syscalls failing with EIO,
+and accesses to the data via a memory map raising SIGBUS.  Note that
+even if some pages in a file fail verification, pages that pass
+verification can still be read.
+
+To verify a data page, fs-verity reads the required hash page(s)
+starting at the leaves and ascending to the root; then, the pages are
+verified descending from the root.  Filesystems that store the verity
+metadata past EOF implement reading hash pages using their usual
+``->readpage{,s}()`` methods, with modifications:
+
+- Verification is skipped for pages beyond ``i_size``.
+- When checking whether a page is in the implicit hole beyond EOF,
+  the full file size (including the verity metadata) is used rather
+  than the original data i_size.  Note that this does not allow
+  userspace to read or mmap the verity metadata.
+
+The hash pages are also cached in the inode's address_space, similar
+to data pages.  However, to simplify the verification logic, a hash
+page being Uptodate doesn't imply that it has been verified; instead,
+the PG_checked bit is used for this purpose.  Hash pages aren't locked
+while being verified, so multiple threads may race to set PG_checked,
+but this doesn't matter.
+
+Thus, when ascending the tree reading hash pages, fs-verity can stop
+as soon as it finds an already-checked hash page.  This optimization,
+which is also used by dm-verity, results in excellent sequential read
+performance since usually the deepest needed hash page will already be
+cached and checked.  However, random reads perform worse.
+
+Files may contain holes.  Normally, the filesystem's
+``->readpage{,s}()`` methods will zero pages in holes and set them
+Uptodate without issuing any bios.  To prevent this from being abused
+to bypass fs-verity, filesystems call ``fsverity_verify_page()`` on
+hole pages.
+
+Like fscrypt, filesystems also disable direct I/O on verity files,
+since direct I/O bypasses the normal read paths.
+
+Userspace utility
+=================
+
+This document focuses on the kernel, but a userspace utility for
+fs-verity can be found at:
+
+	https://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/fsverity-utils.git
+
+See the README.md file in the fsverity-utils source tree for details,
+including examples of setting up fs-verity protected files.
+
+Tests
+=====
+
+To test fs-verity, use xfstests.  For example, using `kvm-xfstests
+<https://git.kernel.org/pub/scm/fs/ext2/xfstests-bld.git/tree/Documentation/kvm-quickstart.md>`_::
+
+    kvm-xfstests -c ext4,f2fs -g verity
diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst
index 46d1b1be3a510..818390c32be63 100644
--- a/Documentation/filesystems/index.rst
+++ b/Documentation/filesystems/index.rst
@@ -359,3 +359,14 @@  encryption of files and directories.
     :maxdepth: 2
 
     fscrypt
+
+Verity API
+==========
+
+A library which filesystems can hook into to support transparent
+authentication of read-only files.
+
+.. toctree::
+    :maxdepth: 2
+
+    fsverity