diff mbox series

[05/14] xfs: document the filesystem metadata checking strategy

Message ID 167243825231.682859.8572572208487315258.stgit@magnolia (mailing list archive)
State New, archived
Headers show
Series xfs: design documentation for online fsck | expand

Commit Message

Darrick J. Wong Dec. 30, 2022, 10:10 p.m. UTC
From: Darrick J. Wong <djwong@kernel.org>

Begin the fifth chapter of the online fsck design documentation, where
we discuss the details of the data structures and algorithms used by the
kernel to examine filesystem metadata and cross-reference it around the
filesystem.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 .../filesystems/xfs-online-fsck-design.rst         |  579 ++++++++++++++++++++
 .../filesystems/xfs-self-describing-metadata.rst   |    1 
 2 files changed, 580 insertions(+)

Comments

Allison Henderson Jan. 21, 2023, 1:38 a.m. UTC | #1
On Fri, 2022-12-30 at 14:10 -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Begin the fifth chapter of the online fsck design documentation,
> where
> we discuss the details of the data structures and algorithms used by
> the
> kernel to examine filesystem metadata and cross-reference it around
> the
> filesystem.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  .../filesystems/xfs-online-fsck-design.rst         |  579
> ++++++++++++++++++++
>  .../filesystems/xfs-self-describing-metadata.rst   |    1 
>  2 files changed, 580 insertions(+)
> 
> 
> diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst
> b/Documentation/filesystems/xfs-online-fsck-design.rst
> index 42e82971e036..f45bf97fa9c4 100644
> --- a/Documentation/filesystems/xfs-online-fsck-design.rst
> +++ b/Documentation/filesystems/xfs-online-fsck-design.rst
> @@ -864,3 +864,582 @@ Proposed patchsets include
>  and
>  `preservation of sickness info during memory reclaim
>  <
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> log/?h=indirect-health-reporting>`_.
> +
> +5. Kernel Algorithms and Data Structures
> +========================================
> +
> +This section discusses the key algorithms and data structures of the
> kernel
> +code that provide the ability to check and repair metadata while the
> system
> +is running.
> +The first chapters in this section reveal the pieces that provide
> the
> +foundation for checking metadata.
> +The remainder of this section presents the mechanisms through which
> XFS
> +regenerates itself.
> +
> +Self Describing Metadata
> +------------------------
> +
> +Starting with XFS version 5 in 2012, XFS updated the format of
> nearly every
> +ondisk block header to record a magic number, a checksum, a
> universally
> +"unique" identifier (UUID), an owner code, the ondisk address of the
> block,
> +and a log sequence number.
> +When loading a block buffer from disk, the magic number, UUID,
> owner, and
> +ondisk address confirm that the retrieved block matches the specific
> owner of
> +the current filesystem, and that the information contained in the
> block is
> +supposed to be found at the ondisk address.
> +The first three components enable checking tools to disregard
> alleged metadata
> +that doesn't belong to the filesystem, and the fourth component
> enables the
> +filesystem to detect lost writes.
Add...

"When ever a file system operation modifies a block, the change is
submitted to the journal as a transaction.  The journal then processes
these transactions marking them done once they are safely committed to
the disk"

At this point we havnt talked much at all about transactions or logs,
and we've just barely begin to cover blocks.  I think you at least want
a quick blip to describe the relation of these two things, or it may
not be clear why we suddenly jumped into logs.

> +
> +The logging code maintains the checksum and the log sequence number
> of the last
> +transactional update.
> +Checksums are useful for detecting torn writes and other mischief
"Checksums (or crc's) are useful for detecting incomplete or torn
writes as well as other discrepancies..."


> between the
> +computer and its storage devices.
> +Sequence number tracking enables log recovery to avoid applying out
> of date
> +log updates to the filesystem.
> +
> +These two features improve overall runtime resiliency by providing a
> means for
> +the filesystem to detect obvious corruption when reading metadata
> blocks from
> +disk, but these buffer verifiers cannot provide any consistency
> checking
> +between metadata structures.
> +
> +For more information, please see the documentation for
> +Documentation/filesystems/xfs-self-describing-metadata.rst
> +
> +Reverse Mapping
> +---------------
> +
> +The original design of XFS (circa 1993) is an improvement upon 1980s
> Unix
> +filesystem design.
> +In those days, storage density was expensive, CPU time was scarce,
> and
> +excessive seek time could kill performance.
> +For performance reasons, filesystem authors were reluctant to add
> redundancy to
> +the filesystem, even at the cost of data integrity.
> +Filesystems designers in the early 21st century choose different
> strategies to
> +increase internal redundancy -- either storing nearly identical
> copies of
> +metadata, or more space-efficient techniques such as erasure coding.
"such as erasure coding which may encode sections of the data with
redundant symbols and in more than one location"

That ties it into the next line.  If you go on to talk about a term you
have not previously defined, i think you want to either define it
quickly or just drop it all together.  Right now your goal is to just
give the reader context, so you want it to move quickly.

> +Obvious corruptions are typically repaired by copying replicas or
> +reconstructing from codes.
> +
I think I would have just jumped straight from xfs history to modern
xfs...
> +For XFS, a different redundancy strategy was chosen to modernize the
> design:
> +a secondary space usage index that maps allocated disk extents back
> to their
> +owners.
> +By adding a new index, the filesystem retains most of its ability to
> scale
> +well to heavily threaded workloads involving large datasets, since
> the primary
> +file metadata (the directory tree, the file block map, and the
> allocation
> +groups) remain unchanged.
> 

> +Although the reverse-mapping feature increases overhead costs for
> space
> +mapping activities just like any other system that improves
> redundancy, it
"Like any system that improves redundancy, the reverse-mapping feature
increases overhead costs for space mapping activities. However, it..."

> +has two critical advantages: first, the reverse index is key to
> enabling online
> +fsck and other requested functionality such as filesystem
> reorganization,
> +better media failure reporting, and shrinking.
> +Second, the different ondisk storage format of the reverse mapping
> btree
> +defeats device-level deduplication, because the filesystem requires
> real
> +redundancy.
> +
> +A criticism of adding the secondary index is that it does nothing to
> improve
> +the robustness of user data storage itself.
> +This is a valid point, but adding a new index for file data block
> checksums
> +increases write amplification and turns data overwrites into copy-
> writes, which
> +age the filesystem prematurely.
> +In keeping with thirty years of precedent, users who want file data
> integrity
> +can supply as powerful a solution as they require.
> +As for metadata, the complexity of adding a new secondary index of
> space usage
> +is much less than adding volume management and storage device
> mirroring to XFS
> +itself.
> +Perfection of RAID and volume management are best left to existing
> layers in
> +the kernel.
I think I would cull the entire above paragraph.  rmap, crc and raid
all have very different points of redundancy, so criticism that an
apple is not an orange or visavis just feels like a shortsighted
comparison that's probably more of a distraction than anything.

Sometimes it feels like this document kinda gets off into tangents
like it's preemptively trying to position it's self for an argument
that hasn't happened yet.  But I think it has the effect of pulling the
readers attention off topic into an argument they never thought to
consider in the first place.  The topic of this section is to explain
what rmap is.  So lets stay on topic and finish laying out that ground
work first before getting into how it compares to other solutions

> +
> +The information captured in a reverse space mapping record is as
> follows:
> +
> +.. code-block:: c
> +
> +       struct xfs_rmap_irec {
> +           xfs_agblock_t    rm_startblock;   /* extent start block
> */
> +           xfs_extlen_t     rm_blockcount;   /* extent length */
> +           uint64_t         rm_owner;        /* extent owner */
> +           uint64_t         rm_offset;       /* offset within the
> owner */
> +           unsigned int     rm_flags;        /* state flags */
> +       };
> +
> +The first two fields capture the location and size of the physical
> space,
> +in units of filesystem blocks.
> +The owner field tells scrub which metadata structure or file inode
> have been
> +assigned this space.
> +For space allocated to files, the offset field tells scrub where the
> space was
> +mapped within the file fork.
> +Finally, the flags field provides extra information about the space
> usage --
> +is this an attribute fork extent?  A file mapping btree extent?  Or
> an
> +unwritten data extent?
> +
> +Online filesystem checking judges the consistency of each primary
> metadata
> +record by comparing its information against all other space indices.
> +The reverse mapping index plays a key role in the consistency
> checking process
> +because it contains a centralized alternate copy of all space
> allocation
> +information.
> +Program runtime and ease of resource acquisition are the only real
> limits to
> +what online checking can consult.
> +For example, a file data extent mapping can be checked against:
> +
> +* The absence of an entry in the free space information.
> +* The absence of an entry in the inode index.
> +* The absence of an entry in the reference count data if the file is
> not
> +  marked as having shared extents.
> +* The correspondence of an entry in the reverse mapping information.
> +
> +A key observation here is that only the reverse mapping can provide
> a positive
> +affirmation of correctness if the primary metadata is in doubt.
if any of the above metadata is in doubt...

> +The checking code for most primary metadata follows a path similar
> to the
> +one outlined above.
> +
> +A second observation to make about this secondary index is that
> proving its
> +consistency with the primary metadata is difficult.

> +Demonstrating that a given reverse mapping record exactly
> corresponds to the
> +primary space metadata involves a full scan of all primary space
> metadata,
> +which is very time intensive.
"But why?" Wonders the reader. Just jump into an example:

"In order to verify that an rmap extent does not incorrectly over lap
with another record, we would need a full scan of all the other
records, which is time intensive."

?

And then the below is a separate observation right?  

> +Scanning activity for online fsck can only use non-blocking lock
> acquisition
> +primitives if the locking order is not the regular order as used by
> the rest of
> +the filesystem.
Lastly, it should be noted that most file system operations tend to
lock primary metadata before locking the secondary metadata.  This
means that scanning operations that acquire the secondary metadata
first may need to yield the secondary lock to filesystem operations
that have already acquired the primary lock. 

?

> +This means that forward progress during this part of a scan of the
> reverse
> +mapping data cannot be guaranteed if system load is especially
> heavy.
> +Therefore, it is not practical for online check to detect reverse
> mapping
> +records that lack a counterpart in the primary metadata.
Such as <quick list / quick example>

> +Instead, scrub relies on rigorous cross-referencing during the
> primary space
> +mapping structure checks.
> +

The below paragraph sounds like a re-cap?

"So to recap, reverse mappings also...."
> +Reverse mappings also play a key role in reconstruction of primary
> metadata.
> +The secondary information is general enough for online repair to
> synthesize a
> +complete copy of any primary space management metadata by locking
> that
> +resource, querying all reverse mapping indices looking for records
> matching
> +the relevant resource, and transforming the mapping into an
> appropriate format.
> +The details of how these records are staged, written to disk, and
> committed
> +into the filesystem are covered in subsequent sections.
I also think the section would be ok if you were to trim off this last
paragraph too.

> +
> +Checking and Cross-Referencing
> +------------------------------
> +
> +The first step of checking a metadata structure is to examine every
> record
> +contained within the structure and its relationship with the rest of
> the
> +system.
> +XFS contains multiple layers of checking to try to prevent
> inconsistent
> +metadata from wreaking havoc on the system.
> +Each of these layers contributes information that helps the kernel
> to make
> +three decisions about the health of a metadata structure:
> +
> +- Is a part of this structure obviously corrupt
> (``XFS_SCRUB_OFLAG_CORRUPT``) ?
> +- Is this structure inconsistent with the rest of the system
> +  (``XFS_SCRUB_OFLAG_XCORRUPT``) ?
> +- Is there so much damage around the filesystem that cross-
> referencing is not
> +  possible (``XFS_SCRUB_OFLAG_XFAIL``) ?
> +- Can the structure be optimized to improve performance or reduce
> the size of
> +  metadata (``XFS_SCRUB_OFLAG_PREEN``) ?
> +- Does the structure contain data that is not inconsistent but
> deserves review
> +  by the system administrator (``XFS_SCRUB_OFLAG_WARNING``) ?
> +
> +The following sections describe how the metadata scrubbing process
> works.
> +
> +Metadata Buffer Verification
> +````````````````````````````
> +
> +The lowest layer of metadata protection in XFS are the metadata
> verifiers built
> +into the buffer cache.
> +These functions perform inexpensive internal consistency checking of
> the block
> +itself, and answer these questions:
> +
> +- Does the block belong to this filesystem?
> +
> +- Does the block belong to the structure that asked for the read?
> +  This assumes that metadata blocks only have one owner, which is
> always true
> +  in XFS.
> +
> +- Is the type of data stored in the block within a reasonable range
> of what
> +  scrub is expecting?
> +
> +- Does the physical location of the block match the location it was
> read from?
> +
> +- Does the block checksum match the data?
> +
> +The scope of the protections here are very limited -- verifiers can
> only
> +establish that the filesystem code is reasonably free of gross
> corruption bugs
> +and that the storage system is reasonably competent at retrieval.
> +Corruption problems observed at runtime cause the generation of
> health reports,
> +failed system calls, and in the extreme case, filesystem shutdowns
> if the
> +corrupt metadata force the cancellation of a dirty transaction.
> +
> +Every online fsck scrubbing function is expected to read every
> ondisk metadata
> +block of a structure in the course of checking the structure.
> +Corruption problems observed during a check are immediately reported
> to
> +userspace as corruption; during a cross-reference, they are reported
> as a
> +failure to cross-reference once the full examination is complete.
> +Reads satisfied by a buffer already in cache (and hence already
> verified)
> +bypass these checks.
> +
> +Internal Consistency Checks
> +```````````````````````````
> +
> +The next higher level of metadata protection is the internal record
"After the buffer cache, the next level of metadata protection is..."
> +verification code built into the filesystem.

> +These checks are split between the buffer verifiers, the in-
> filesystem users of
> +the buffer cache, and the scrub code itself, depending on the amount
> of higher
> +level context required.
> +The scope of checking is still internal to the block.
> +For performance reasons, regular code may skip some of these checks
> unless
> +debugging is enabled or a write is about to occur.
> +Scrub functions, of course, must check all possible problems.
I'd put this chunk after the list below.

> +Either way, these higher level checking functions answer these
> questions:
Then this becomes:
"These higher level checking functions..."

> +
> +- Does the type of data stored in the block match what scrub is
> expecting?
> +
> +- Does the block belong to the owning structure that asked for the
> read?
> +
> +- If the block contains records, do the records fit within the
> block?
> +
> +- If the block tracks internal free space information, is it
> consistent with
> +  the record areas?
> +
> +- Are the records contained inside the block free of obvious
> corruptions?
> +
> +Record checks in this category are more rigorous and more time-
> intensive.
> +For example, block pointers and inumbers are checked to ensure that
> they point
> +within the dynamically allocated parts of an allocation group and
> within
> +the filesystem.
> +Names are checked for invalid characters, and flags are checked for
> invalid
> +combinations.
> +Other record attributes are checked for sensible values.
> +Btree records spanning an interval of the btree keyspace are checked
> for
> +correct order and lack of mergeability (except for file fork
> mappings).
> +
> +Validation of Userspace-Controlled Record Attributes
> +````````````````````````````````````````````````````
> +
> +Various pieces of filesystem metadata are directly controlled by
> userspace.
> +Because of this nature, validation work cannot be more precise than
> checking
> +that a value is within the possible range.
> +These fields include:
> +
> +- Superblock fields controlled by mount options
> +- Filesystem labels
> +- File timestamps
> +- File permissions
> +- File size
> +- File flags
> +- Names present in directory entries, extended attribute keys, and
> filesystem
> +  labels
> +- Extended attribute key namespaces
> +- Extended attribute values
> +- File data block contents
> +- Quota limits
> +- Quota timer expiration (if resource usage exceeds the soft limit)
> +
> +Cross-Referencing Space Metadata
> +````````````````````````````````
> +
> +The next higher level of checking is cross-referencing records
> between metadata

I kinda like the list first so that the reader has an idea of what
these checks are before getting into discussion about them.  It just
makes it a little more obvious as to why it's "prohibitively expensive"
or "dependent on the context of the structure" after having just looked
at it

The rest looks good from here.

Allison

> +structures.
> +For regular runtime code, the cost of these checks is considered to
> be
> +prohibitively expensive, but as scrub is dedicated to rooting out
> +inconsistencies, it must pursue all avenues of inquiry.
> +The exact set of cross-referencing is highly dependent on the
> context of the
> +data structure being checked.
> +
> +The XFS btree code has keyspace scanning functions that online fsck
> uses to
> +cross reference one structure with another.
> +Specifically, scrub can scan the key space of an index to determine
> if that
> +keyspace is fully, sparsely, or not at all mapped to records.
> +For the reverse mapping btree, it is possible to mask parts of the
> key for the
> +purposes of performing a keyspace scan so that scrub can decide if
> the rmap
> +btree contains records mapping a certain extent of physical space
> without the
> +sparsenses of the rest of the rmap keyspace getting in the way.
> +
> +Btree blocks undergo the following checks before cross-referencing:
> +
> +- Does the type of data stored in the block match what scrub is
> expecting?
> +
> +- Does the block belong to the owning structure that asked for the
> read?
> +
> +- Do the records fit within the block?
> +
> +- Are the records contained inside the block free of obvious
> corruptions?
> +
> +- Are the name hashes in the correct order?
> +
> +- Do node pointers within the btree point to valid block addresses
> for the type
> +  of btree?
> +
> +- Do child pointers point towards the leaves?
> +
> +- Do sibling pointers point across the same level?
> +
> +- For each node block record, does the record key accurate reflect
> the contents
> +  of the child block?
> +
> +Space allocation records are cross-referenced as follows:
> +
> +1. Any space mentioned by any metadata structure are cross-
> referenced as
> +   follows:
> +
> +   - Does the reverse mapping index list only the appropriate owner
> as the
> +     owner of each block?
> +
> +   - Are none of the blocks claimed as free space?
> +
> +   - If these aren't file data blocks, are none of the blocks
> claimed as space
> +     shared by different owners?
> +
> +2. Btree blocks are cross-referenced as follows:
> +
> +   - Everything in class 1 above.
> +
> +   - If there's a parent node block, do the keys listed for this
> block match the
> +     keyspace of this block?
> +
> +   - Do the sibling pointers point to valid blocks?  Of the same
> level?
> +
> +   - Do the child pointers point to valid blocks?  Of the next level
> down?
> +
> +3. Free space btree records are cross-referenced as follows:
> +
> +   - Everything in class 1 and 2 above.
> +
> +   - Does the reverse mapping index list no owners of this space?
> +
> +   - Is this space not claimed by the inode index for inodes?
> +
> +   - Is it not mentioned by the reference count index?
> +
> +   - Is there a matching record in the other free space btree?
> +
> +4. Inode btree records are cross-referenced as follows:
> +
> +   - Everything in class 1 and 2 above.
> +
> +   - Is there a matching record in free inode btree?
> +
> +   - Do cleared bits in the holemask correspond with inode clusters?
> +
> +   - Do set bits in the freemask correspond with inode records with
> zero link
> +     count?
> +
> +5. Inode records are cross-referenced as follows:
> +
> +   - Everything in class 1.
> +
> +   - Do all the fields that summarize information about the file
> forks actually
> +     match those forks?
> +
> +   - Does each inode with zero link count correspond to a record in
> the free
> +     inode btree?
> +
> +6. File fork space mapping records are cross-referenced as follows:
> +
> +   - Everything in class 1 and 2 above.
> +
> +   - Is this space not mentioned by the inode btrees?
> +
> +   - If this is a CoW fork mapping, does it correspond to a CoW
> entry in the
> +     reference count btree?
> +
> +7. Reference count records are cross-referenced as follows:
> +
> +   - Everything in class 1 and 2 above.
> +
> +   - Within the space subkeyspace of the rmap btree (that is to say,
> all
> +     records mapped to a particular space extent and ignoring the
> owner info),
> +     are there the same number of reverse mapping records for each
> block as the
> +     reference count record claims?
> +
> +Proposed patchsets are the series to find gaps in
> +`refcount btree
> +<
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> log/?h=scrub-detect-refcount-gaps>`_,
> +`inode btree
> +<
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> log/?h=scrub-detect-inobt-gaps>`_, and
> +`rmap btree
> +<
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> log/?h=scrub-detect-rmapbt-gaps>`_ records;
> +to find
> +`mergeable records
> +<
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> log/?h=scrub-detect-mergeable-records>`_;
> +and to
> +`improve cross referencing with rmap
> +<
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> log/?h=scrub-strengthen-rmap-checking>`_
> +before starting a repair.
> +
> +Checking Extended Attributes
> +````````````````````````````
> +
> +Extended attributes implement a key-value store that enable
> fragments of data
> +to be attached to any file.
> +Both the kernel and userspace can access the keys and values,
> subject to
> +namespace and privilege restrictions.
> +Most typically these fragments are metadata about the file --
> origins, security
> +contexts, user-supplied labels, indexing information, etc.
> +
> +Names can be as long as 255 bytes and can exist in several different
> +namespaces.
> +Values can be as large as 64KB.
> +A file's extended attributes are stored in blocks mapped by the attr
> fork.
> +The mappings point to leaf blocks, remote value blocks, or dabtree
> blocks.
> +Block 0 in the attribute fork is always the top of the structure,
> but otherwise
> +each of the three types of blocks can be found at any offset in the
> attr fork.
> +Leaf blocks contain attribute key records that point to the name and
> the value.
> +Names are always stored elsewhere in the same leaf block.
> +Values that are less than 3/4 the size of a filesystem block are
> also stored
> +elsewhere in the same leaf block.
> +Remote value blocks contain values that are too large to fit inside
> a leaf.
> +If the leaf information exceeds a single filesystem block, a dabtree
> (also
> +rooted at block 0) is created to map hashes of the attribute names
> to leaf
> +blocks in the attr fork.
> +
> +Checking an extended attribute structure is not so straightfoward
> due to the
> +lack of separation between attr blocks and index blocks.
> +Scrub must read each block mapped by the attr fork and ignore the
> non-leaf
> +blocks:
> +
> +1. Walk the dabtree in the attr fork (if present) to ensure that
> there are no
> +   irregularities in the blocks or dabtree mappings that do not
> point to
> +   attr leaf blocks.
> +
> +2. Walk the blocks of the attr fork looking for leaf blocks.
> +   For each entry inside a leaf:
> +
> +   a. Validate that the name does not contain invalid characters.
> +
> +   b. Read the attr value.
> +      This performs a named lookup of the attr name to ensure the
> correctness
> +      of the dabtree.
> +      If the value is stored in a remote block, this also validates
> the
> +      integrity of the remote value block.
> +
> +Checking and Cross-Referencing Directories
> +``````````````````````````````````````````
> +
> +The filesystem directory tree is a directed acylic graph structure,
> with files
> +constituting the nodes, and directory entries (dirents) constituting
> the edges.
> +Directories are a special type of file containing a set of mappings
> from a
> +255-byte sequence (name) to an inumber.
> +These are called directory entries, or dirents for short.
> +Each directory file must have exactly one directory pointing to the
> file.
> +A root directory points to itself.
> +Directory entries point to files of any type.
> +Each non-directory file may have multiple directories point to it.
> +
> +In XFS, directories are implemented as a file containing up to three
> 32GB
> +partitions.
> +The first partition contains directory entry data blocks.
> +Each data block contains variable-sized records associating a user-
> provided
> +name with an inumber and, optionally, a file type.
> +If the directory entry data grows beyond one block, the second
> partition (which
> +exists as post-EOF extents) is populated with a block containing
> free space
> +information and an index that maps hashes of the dirent names to
> directory data
> +blocks in the first partition.
> +This makes directory name lookups very fast.
> +If this second partition grows beyond one block, the third partition
> is
> +populated with a linear array of free space information for faster
> +expansions.
> +If the free space has been separated and the second partition grows
> again
> +beyond one block, then a dabtree is used to map hashes of dirent
> names to
> +directory data blocks.
> +
> +Checking a directory is pretty straightfoward:
> +
> +1. Walk the dabtree in the second partition (if present) to ensure
> that there
> +   are no irregularities in the blocks or dabtree mappings that do
> not point to
> +   dirent blocks.
> +
> +2. Walk the blocks of the first partition looking for directory
> entries.
> +   Each dirent is checked as follows:
> +
> +   a. Does the name contain no invalid characters?
> +
> +   b. Does the inumber correspond to an actual, allocated inode?
> +
> +   c. Does the child inode have a nonzero link count?
> +
> +   d. If a file type is included in the dirent, does it match the
> type of the
> +      inode?
> +
> +   e. If the child is a subdirectory, does the child's dotdot
> pointer point
> +      back to the parent?
> +
> +   f. If the directory has a second partition, perform a named
> lookup of the
> +      dirent name to ensure the correctness of the dabtree.
> +
> +3. Walk the free space list in the third partition (if present) to
> ensure that
> +   the free spaces it describes are really unused.
> +
> +Checking operations involving :ref:`parents <dirparent>` and
> +:ref:`file link counts <nlinks>` are discussed in more detail in
> later
> +sections.
> +
> +Checking Directory/Attribute Btrees
> +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> +
> +As stated in previous sections, the directory/attribute btree
> (dabtree) index
> +maps user-provided names to improve lookup times by avoiding linear
> scans.
> +Internally, it maps a 32-bit hash of the name to a block offset
> within the
> +appropriate file fork.
> +
> +The internal structure of a dabtree closely resembles the btrees
> that record
> +fixed-size metadata records -- each dabtree block contains a magic
> number, a
> +checksum, sibling pointers, a UUID, a tree level, and a log sequence
> number.
> +The format of leaf and node records are the same -- each entry
> points to the
> +next level down in the hierarchy, with dabtree node records pointing
> to dabtree
> +leaf blocks, and dabtree leaf records pointing to non-dabtree blocks
> elsewhere
> +in the fork.
> +
> +Checking and cross-referencing the dabtree is very similar to what
> is done for
> +space btrees:
> +
> +- Does the type of data stored in the block match what scrub is
> expecting?
> +
> +- Does the block belong to the owning structure that asked for the
> read?
> +
> +- Do the records fit within the block?
> +
> +- Are the records contained inside the block free of obvious
> corruptions?
> +
> +- Are the name hashes in the correct order?
> +
> +- Do node pointers within the dabtree point to valid fork offsets
> for dabtree
> +  blocks?
> +
> +- Do leaf pointers within the dabtree point to valid fork offsets
> for directory
> +  or attr leaf blocks?
> +
> +- Do child pointers point towards the leaves?
> +
> +- Do sibling pointers point across the same level?
> +
> +- For each dabtree node record, does the record key accurate reflect
> the
> +  contents of the child dabtree block?
> +
> +- For each dabtree leaf record, does the record key accurate reflect
> the
> +  contents of the directory or attr block?
> +
> +Cross-Referencing Summary Counters
> +``````````````````````````````````
> +
> +XFS maintains three classes of summary counters: available
> resources, quota
> +resource usage, and file link counts.
> +
> +In theory, the amount of available resources (data blocks, inodes,
> realtime
> +extents) can be found by walking the entire filesystem.
> +This would make for very slow reporting, so a transactional
> filesystem can
> +maintain summaries of this information in the superblock.
> +Cross-referencing these values against the filesystem metadata
> should be a
> +simple matter of walking the free space and inode metadata in each
> AG and the
> +realtime bitmap, but there are complications that will be discussed
> in
> +:ref:`more detail <fscounters>` later.
> +
> +:ref:`Quota usage <quotacheck>` and :ref:`file link count <nlinks>`
> +checking are sufficiently complicated to warrant separate sections.
> +
> +Post-Repair Reverification
> +``````````````````````````
> +
> +After performing a repair, the checking code is run a second time to
> validate
> +the new structure, and the results of the health assessment are
> recorded
> +internally and returned to the calling process.
> +This step is critical for enabling system administrator to monitor
> the status
> +of the filesystem and the progress of any repairs.
> +For developers, it is a useful means to judge the efficacy of error
> detection
> +and correction in the online and offline checking tools.
> diff --git a/Documentation/filesystems/xfs-self-describing-
> metadata.rst b/Documentation/filesystems/xfs-self-describing-
> metadata.rst
> index b79dbf36dc94..a10c4ae6955e 100644
> --- a/Documentation/filesystems/xfs-self-describing-metadata.rst
> +++ b/Documentation/filesystems/xfs-self-describing-metadata.rst
> @@ -1,4 +1,5 @@
>  .. SPDX-License-Identifier: GPL-2.0
> +.. _xfs_self_describing_metadata:
>  
>  ============================
>  XFS Self Describing Metadata
>
Darrick J. Wong Feb. 2, 2023, 7:04 p.m. UTC | #2
On Sat, Jan 21, 2023 at 01:38:33AM +0000, Allison Henderson wrote:
> On Fri, 2022-12-30 at 14:10 -0800, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > Begin the fifth chapter of the online fsck design documentation,
> > where
> > we discuss the details of the data structures and algorithms used by
> > the
> > kernel to examine filesystem metadata and cross-reference it around
> > the
> > filesystem.
> > 
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> >  .../filesystems/xfs-online-fsck-design.rst         |  579
> > ++++++++++++++++++++
> >  .../filesystems/xfs-self-describing-metadata.rst   |    1 
> >  2 files changed, 580 insertions(+)
> > 
> > 
> > diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst
> > b/Documentation/filesystems/xfs-online-fsck-design.rst
> > index 42e82971e036..f45bf97fa9c4 100644
> > --- a/Documentation/filesystems/xfs-online-fsck-design.rst
> > +++ b/Documentation/filesystems/xfs-online-fsck-design.rst
> > @@ -864,3 +864,582 @@ Proposed patchsets include
> >  and
> >  `preservation of sickness info during memory reclaim
> >  <
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > log/?h=indirect-health-reporting>`_.
> > +
> > +5. Kernel Algorithms and Data Structures
> > +========================================
> > +
> > +This section discusses the key algorithms and data structures of the
> > kernel
> > +code that provide the ability to check and repair metadata while the
> > system
> > +is running.
> > +The first chapters in this section reveal the pieces that provide
> > the
> > +foundation for checking metadata.
> > +The remainder of this section presents the mechanisms through which
> > XFS
> > +regenerates itself.
> > +
> > +Self Describing Metadata
> > +------------------------
> > +
> > +Starting with XFS version 5 in 2012, XFS updated the format of
> > nearly every
> > +ondisk block header to record a magic number, a checksum, a
> > universally
> > +"unique" identifier (UUID), an owner code, the ondisk address of the
> > block,
> > +and a log sequence number.
> > +When loading a block buffer from disk, the magic number, UUID,
> > owner, and
> > +ondisk address confirm that the retrieved block matches the specific
> > owner of
> > +the current filesystem, and that the information contained in the
> > block is
> > +supposed to be found at the ondisk address.
> > +The first three components enable checking tools to disregard
> > alleged metadata
> > +that doesn't belong to the filesystem, and the fourth component
> > enables the
> > +filesystem to detect lost writes.
> Add...
> 
> "When ever a file system operation modifies a block, the change is
> submitted to the journal as a transaction.  The journal then processes
> these transactions marking them done once they are safely committed to
> the disk"

Ok, I'll add that transition.  Though I'll s/journal/log/ since this is
xfs. :)

> At this point we havnt talked much at all about transactions or logs,
> and we've just barely begin to cover blocks.  I think you at least want
> a quick blip to describe the relation of these two things, or it may
> not be clear why we suddenly jumped into logs.

Point taken.  Thanks for the suggestion.

> > +
> > +The logging code maintains the checksum and the log sequence number
> > of the last
> > +transactional update.
> > +Checksums are useful for detecting torn writes and other mischief
> "Checksums (or crc's) are useful for detecting incomplete or torn
> writes as well as other discrepancies..."

Checksums are a general concept, whereas CRCs denote a particular family
of checksums.  The statement would still apply even if we used a
different family (e.g. erasure codes, cryptographic hash functions) of
function instead of crc32c.

I will, however, avoid the undefined term 'mischief'.  Thanks for the
correction.

"Checksums are useful for detecting torn writes and other discrepancies
that can be introduced between the computer and its storage devices."

> > between the
> > +computer and its storage devices.
> > +Sequence number tracking enables log recovery to avoid applying out
> > of date
> > +log updates to the filesystem.
> > +
> > +These two features improve overall runtime resiliency by providing a
> > means for
> > +the filesystem to detect obvious corruption when reading metadata
> > blocks from
> > +disk, but these buffer verifiers cannot provide any consistency
> > checking
> > +between metadata structures.
> > +
> > +For more information, please see the documentation for
> > +Documentation/filesystems/xfs-self-describing-metadata.rst
> > +
> > +Reverse Mapping
> > +---------------
> > +
> > +The original design of XFS (circa 1993) is an improvement upon 1980s
> > Unix
> > +filesystem design.
> > +In those days, storage density was expensive, CPU time was scarce,
> > and
> > +excessive seek time could kill performance.
> > +For performance reasons, filesystem authors were reluctant to add
> > redundancy to
> > +the filesystem, even at the cost of data integrity.
> > +Filesystems designers in the early 21st century choose different
> > strategies to
> > +increase internal redundancy -- either storing nearly identical
> > copies of
> > +metadata, or more space-efficient techniques such as erasure coding.
> "such as erasure coding which may encode sections of the data with
> redundant symbols and in more than one location"
> 
> That ties it into the next line.  If you go on to talk about a term you
> have not previously defined, i think you want to either define it
> quickly or just drop it all together.  Right now your goal is to just
> give the reader context, so you want it to move quickly.

How about I shorten it to:

"...or more space-efficient encoding techniques." ?

and end the paragraph there?

> > +Obvious corruptions are typically repaired by copying replicas or
> > +reconstructing from codes.
> > +
> I think I would have just jumped straight from xfs history to modern
> xfs...
> > +For XFS, a different redundancy strategy was chosen to modernize the
> > design:
> > +a secondary space usage index that maps allocated disk extents back
> > to their
> > +owners.
> > +By adding a new index, the filesystem retains most of its ability to
> > scale
> > +well to heavily threaded workloads involving large datasets, since
> > the primary
> > +file metadata (the directory tree, the file block map, and the
> > allocation
> > +groups) remain unchanged.
> > 
> 
> > +Although the reverse-mapping feature increases overhead costs for
> > space
> > +mapping activities just like any other system that improves
> > redundancy, it
> "Like any system that improves redundancy, the reverse-mapping feature
> increases overhead costs for space mapping activities. However, it..."

I like this better.  These two sentences have been changed to read:

"Like any system that improves redundancy, the reverse-mapping feature
increases overhead costs for space mapping activities.  However, it has
two critical advantages: first, the reverse index is key to enabling
online fsck and other requested functionality such as free space
defragmentation, better media failure reporting, and filesystem
shrinking."

> > +has two critical advantages: first, the reverse index is key to
> > enabling online
> > +fsck and other requested functionality such as filesystem
> > reorganization,
> > +better media failure reporting, and shrinking.
> > +Second, the different ondisk storage format of the reverse mapping
> > btree
> > +defeats device-level deduplication, because the filesystem requires
> > real
> > +redundancy.
> > +
> > +A criticism of adding the secondary index is that it does nothing to
> > improve
> > +the robustness of user data storage itself.
> > +This is a valid point, but adding a new index for file data block
> > checksums
> > +increases write amplification and turns data overwrites into copy-
> > writes, which
> > +age the filesystem prematurely.
> > +In keeping with thirty years of precedent, users who want file data
> > integrity
> > +can supply as powerful a solution as they require.
> > +As for metadata, the complexity of adding a new secondary index of
> > space usage
> > +is much less than adding volume management and storage device
> > mirroring to XFS
> > +itself.
> > +Perfection of RAID and volume management are best left to existing
> > layers in
> > +the kernel.
> I think I would cull the entire above paragraph.  rmap, crc and raid
> all have very different points of redundancy, so criticism that an
> apple is not an orange or visavis just feels like a shortsighted
> comparison that's probably more of a distraction than anything.
> 
> Sometimes it feels like this document kinda gets off into tangents
> like it's preemptively trying to position it's self for an argument
> that hasn't happened yet.

It does!  Each of the many tangents that you've pointed out are a
reaction to some discussion that we've had on the list, or at an
LSF, or <cough> fs nerds sniping on social media.  The reason I
capture all of these offtopic arguments is to discourage people from
wasting time rehashing discussions that were settled long ago.

Admittedly, that is a very defensive reaction on my part...

> But I think it has the effect of pulling the
> readers attention off topic into an argument they never thought to
> consider in the first place.  The topic of this section is to explain
> what rmap is.  So lets stay on topic and finish laying out that ground
> work first before getting into how it compares to other solutions

...and you're right to point out that mentioning these things is
distracting and provides fuel to reignite a flamewar.  At the same time,
I think there's value in identifying the roads not taken, and why.

What if I turned these tangents into explicitly labelled sidebars?
Would that help readers who want to stick to the topic?

> > +
> > +The information captured in a reverse space mapping record is as
> > follows:
> > +
> > +.. code-block:: c
> > +
> > +       struct xfs_rmap_irec {
> > +           xfs_agblock_t    rm_startblock;   /* extent start block
> > */
> > +           xfs_extlen_t     rm_blockcount;   /* extent length */
> > +           uint64_t         rm_owner;        /* extent owner */
> > +           uint64_t         rm_offset;       /* offset within the
> > owner */
> > +           unsigned int     rm_flags;        /* state flags */
> > +       };
> > +
> > +The first two fields capture the location and size of the physical
> > space,
> > +in units of filesystem blocks.
> > +The owner field tells scrub which metadata structure or file inode
> > have been
> > +assigned this space.
> > +For space allocated to files, the offset field tells scrub where the
> > space was
> > +mapped within the file fork.
> > +Finally, the flags field provides extra information about the space
> > usage --
> > +is this an attribute fork extent?  A file mapping btree extent?  Or
> > an
> > +unwritten data extent?
> > +
> > +Online filesystem checking judges the consistency of each primary
> > metadata
> > +record by comparing its information against all other space indices.
> > +The reverse mapping index plays a key role in the consistency
> > checking process
> > +because it contains a centralized alternate copy of all space
> > allocation
> > +information.
> > +Program runtime and ease of resource acquisition are the only real
> > limits to
> > +what online checking can consult.
> > +For example, a file data extent mapping can be checked against:
> > +
> > +* The absence of an entry in the free space information.
> > +* The absence of an entry in the inode index.
> > +* The absence of an entry in the reference count data if the file is
> > not
> > +  marked as having shared extents.
> > +* The correspondence of an entry in the reverse mapping information.
> > +
> > +A key observation here is that only the reverse mapping can provide
> > a positive
> > +affirmation of correctness if the primary metadata is in doubt.
> if any of the above metadata is in doubt...

Fixed.

> > +The checking code for most primary metadata follows a path similar
> > to the
> > +one outlined above.
> > +
> > +A second observation to make about this secondary index is that
> > proving its
> > +consistency with the primary metadata is difficult.
> 
> > +Demonstrating that a given reverse mapping record exactly
> > corresponds to the
> > +primary space metadata involves a full scan of all primary space
> > metadata,
> > +which is very time intensive.
> "But why?" Wonders the reader. Just jump into an example:
> 
> "In order to verify that an rmap extent does not incorrectly over lap
> with another record, we would need a full scan of all the other
> records, which is time intensive."

I want to shorten it even further:

"Validating that reverse mapping records are correct requires a full
scan of all primary space metadata, which is very time intensive."

> 
> ?
> 
> And then the below is a separate observation right?  

Right.

> > +Scanning activity for online fsck can only use non-blocking lock
> > acquisition
> > +primitives if the locking order is not the regular order as used by
> > the rest of
> > +the filesystem.
> Lastly, it should be noted that most file system operations tend to
> lock primary metadata before locking the secondary metadata.

This isn't accurate -- metadata structures don't have separate locks.
So it's not true to say that we lock primary or secondary metadata.

We /can/ say that file operations lock the inode, then the AGI, then the
AGF; or that directory operations lock the parent and child ILOCKs in
inumber order; and that if scrub wants to take locks in any other order,
it can only do that via trylocks and backoff.

> This
> means that scanning operations that acquire the secondary metadata
> first may need to yield the secondary lock to filesystem operations
> that have already acquired the primary lock. 
> 
> ?
> 
> > +This means that forward progress during this part of a scan of the
> > reverse
> > +mapping data cannot be guaranteed if system load is especially
> > heavy.
> > +Therefore, it is not practical for online check to detect reverse
> > mapping
> > +records that lack a counterpart in the primary metadata.
> Such as <quick list / quick example>
> 
> > +Instead, scrub relies on rigorous cross-referencing during the
> > primary space
> > +mapping structure checks.

I've converted this section into a bullet list:

"There are several observations to make about reverse mapping indices:

"1. Reverse mappings can provide a positive affirmation of correctness if
any of the above primary metadata are in doubt.  The checking code for
most primary metadata follows a path similar to the one outlined above.

"2. Proving the consistency of secondary metadata with the primary
metadata is difficult because that requires a full scan of all primary
space metadata, which is very time intensive.  For example, checking a
reverse mapping record for a file extent mapping btree block requires
locking the file and searching the entire btree to confirm the block.
Instead, scrub relies on rigorous cross-referencing during the primary
space mapping structure checks.

"3. Consistency scans must use non-blocking lock acquisition primitives
if the required locking order is not the same order used by regular
filesystem operations.  This means that forward progress during this
part of a scan of the reverse mapping data cannot be guaranteed if
system load is heavy."

> > +
> 
> The below paragraph sounds like a re-cap?
> 
> "So to recap, reverse mappings also...."

Yep.

> > +Reverse mappings also play a key role in reconstruction of primary
> > metadata.
> > +The secondary information is general enough for online repair to
> > synthesize a
> > +complete copy of any primary space management metadata by locking
> > that
> > +resource, querying all reverse mapping indices looking for records
> > matching
> > +the relevant resource, and transforming the mapping into an
> > appropriate format.
> > +The details of how these records are staged, written to disk, and
> > committed
> > +into the filesystem are covered in subsequent sections.
> I also think the section would be ok if you were to trim off this last
> paragraph too.

Hm.  I still want to set up the expectation that there's more to come.
How about a brief two-sentence transition paragraph:

"In summary, reverse mappings play a key role in reconstruction of
primary metadata.  The details of how these records are staged, written
to disk, and committed into the filesystem are covered in subsequent
sections."

> 
> > +
> > +Checking and Cross-Referencing
> > +------------------------------
> > +
> > +The first step of checking a metadata structure is to examine every
> > record
> > +contained within the structure and its relationship with the rest of
> > the
> > +system.
> > +XFS contains multiple layers of checking to try to prevent
> > inconsistent
> > +metadata from wreaking havoc on the system.
> > +Each of these layers contributes information that helps the kernel
> > to make
> > +three decisions about the health of a metadata structure:
> > +
> > +- Is a part of this structure obviously corrupt
> > (``XFS_SCRUB_OFLAG_CORRUPT``) ?
> > +- Is this structure inconsistent with the rest of the system
> > +  (``XFS_SCRUB_OFLAG_XCORRUPT``) ?
> > +- Is there so much damage around the filesystem that cross-
> > referencing is not
> > +  possible (``XFS_SCRUB_OFLAG_XFAIL``) ?
> > +- Can the structure be optimized to improve performance or reduce
> > the size of
> > +  metadata (``XFS_SCRUB_OFLAG_PREEN``) ?
> > +- Does the structure contain data that is not inconsistent but
> > deserves review
> > +  by the system administrator (``XFS_SCRUB_OFLAG_WARNING``) ?
> > +
> > +The following sections describe how the metadata scrubbing process
> > works.
> > +
> > +Metadata Buffer Verification
> > +````````````````````````````
> > +
> > +The lowest layer of metadata protection in XFS are the metadata
> > verifiers built
> > +into the buffer cache.
> > +These functions perform inexpensive internal consistency checking of
> > the block
> > +itself, and answer these questions:
> > +
> > +- Does the block belong to this filesystem?
> > +
> > +- Does the block belong to the structure that asked for the read?
> > +  This assumes that metadata blocks only have one owner, which is
> > always true
> > +  in XFS.
> > +
> > +- Is the type of data stored in the block within a reasonable range
> > of what
> > +  scrub is expecting?
> > +
> > +- Does the physical location of the block match the location it was
> > read from?
> > +
> > +- Does the block checksum match the data?
> > +
> > +The scope of the protections here are very limited -- verifiers can
> > only
> > +establish that the filesystem code is reasonably free of gross
> > corruption bugs
> > +and that the storage system is reasonably competent at retrieval.
> > +Corruption problems observed at runtime cause the generation of
> > health reports,
> > +failed system calls, and in the extreme case, filesystem shutdowns
> > if the
> > +corrupt metadata force the cancellation of a dirty transaction.
> > +
> > +Every online fsck scrubbing function is expected to read every
> > ondisk metadata
> > +block of a structure in the course of checking the structure.
> > +Corruption problems observed during a check are immediately reported
> > to
> > +userspace as corruption; during a cross-reference, they are reported
> > as a
> > +failure to cross-reference once the full examination is complete.
> > +Reads satisfied by a buffer already in cache (and hence already
> > verified)
> > +bypass these checks.
> > +
> > +Internal Consistency Checks
> > +```````````````````````````
> > +
> > +The next higher level of metadata protection is the internal record
> "After the buffer cache, the next level of metadata protection is..."

Changed.  I'll do the same to the next section as well.

> > +verification code built into the filesystem.
> 
> > +These checks are split between the buffer verifiers, the in-
> > filesystem users of
> > +the buffer cache, and the scrub code itself, depending on the amount
> > of higher
> > +level context required.
> > +The scope of checking is still internal to the block.
> > +For performance reasons, regular code may skip some of these checks
> > unless
> > +debugging is enabled or a write is about to occur.
> > +Scrub functions, of course, must check all possible problems.
> I'd put this chunk after the list below.
> 
> > +Either way, these higher level checking functions answer these
> > questions:
> Then this becomes:
> "These higher level checking functions..."

Done.

> > +
> > +- Does the type of data stored in the block match what scrub is
> > expecting?
> > +
> > +- Does the block belong to the owning structure that asked for the
> > read?
> > +
> > +- If the block contains records, do the records fit within the
> > block?
> > +
> > +- If the block tracks internal free space information, is it
> > consistent with
> > +  the record areas?
> > +
> > +- Are the records contained inside the block free of obvious
> > corruptions?
> > +
> > +Record checks in this category are more rigorous and more time-
> > intensive.
> > +For example, block pointers and inumbers are checked to ensure that
> > they point
> > +within the dynamically allocated parts of an allocation group and
> > within
> > +the filesystem.
> > +Names are checked for invalid characters, and flags are checked for
> > invalid
> > +combinations.
> > +Other record attributes are checked for sensible values.
> > +Btree records spanning an interval of the btree keyspace are checked
> > for
> > +correct order and lack of mergeability (except for file fork
> > mappings).
> > +
> > +Validation of Userspace-Controlled Record Attributes
> > +````````````````````````````````````````````````````
> > +
> > +Various pieces of filesystem metadata are directly controlled by
> > userspace.
> > +Because of this nature, validation work cannot be more precise than
> > checking
> > +that a value is within the possible range.
> > +These fields include:
> > +
> > +- Superblock fields controlled by mount options
> > +- Filesystem labels
> > +- File timestamps
> > +- File permissions
> > +- File size
> > +- File flags
> > +- Names present in directory entries, extended attribute keys, and
> > filesystem
> > +  labels
> > +- Extended attribute key namespaces
> > +- Extended attribute values
> > +- File data block contents
> > +- Quota limits
> > +- Quota timer expiration (if resource usage exceeds the soft limit)
> > +
> > +Cross-Referencing Space Metadata
> > +````````````````````````````````
> > +
> > +The next higher level of checking is cross-referencing records
> > between metadata
> 
> I kinda like the list first so that the reader has an idea of what
> these checks are before getting into discussion about them.  It just
> makes it a little more obvious as to why it's "prohibitively expensive"
> or "dependent on the context of the structure" after having just looked
> at it

<nod>

> The rest looks good from here.

Woot.  Onto the next reply! :)

--D

> Allison
> 
> > +structures.
> > +For regular runtime code, the cost of these checks is considered to
> > be
> > +prohibitively expensive, but as scrub is dedicated to rooting out
> > +inconsistencies, it must pursue all avenues of inquiry.
> > +The exact set of cross-referencing is highly dependent on the
> > context of the
> > +data structure being checked.
> > +
> > +The XFS btree code has keyspace scanning functions that online fsck
> > uses to
> > +cross reference one structure with another.
> > +Specifically, scrub can scan the key space of an index to determine
> > if that
> > +keyspace is fully, sparsely, or not at all mapped to records.
> > +For the reverse mapping btree, it is possible to mask parts of the
> > key for the
> > +purposes of performing a keyspace scan so that scrub can decide if
> > the rmap
> > +btree contains records mapping a certain extent of physical space
> > without the
> > +sparsenses of the rest of the rmap keyspace getting in the way.
> > +
> > +Btree blocks undergo the following checks before cross-referencing:
> > +
> > +- Does the type of data stored in the block match what scrub is
> > expecting?
> > +
> > +- Does the block belong to the owning structure that asked for the
> > read?
> > +
> > +- Do the records fit within the block?
> > +
> > +- Are the records contained inside the block free of obvious
> > corruptions?
> > +
> > +- Are the name hashes in the correct order?
> > +
> > +- Do node pointers within the btree point to valid block addresses
> > for the type
> > +  of btree?
> > +
> > +- Do child pointers point towards the leaves?
> > +
> > +- Do sibling pointers point across the same level?
> > +
> > +- For each node block record, does the record key accurate reflect
> > the contents
> > +  of the child block?
> > +
> > +Space allocation records are cross-referenced as follows:
> > +
> > +1. Any space mentioned by any metadata structure are cross-
> > referenced as
> > +   follows:
> > +
> > +   - Does the reverse mapping index list only the appropriate owner
> > as the
> > +     owner of each block?
> > +
> > +   - Are none of the blocks claimed as free space?
> > +
> > +   - If these aren't file data blocks, are none of the blocks
> > claimed as space
> > +     shared by different owners?
> > +
> > +2. Btree blocks are cross-referenced as follows:
> > +
> > +   - Everything in class 1 above.
> > +
> > +   - If there's a parent node block, do the keys listed for this
> > block match the
> > +     keyspace of this block?
> > +
> > +   - Do the sibling pointers point to valid blocks?  Of the same
> > level?
> > +
> > +   - Do the child pointers point to valid blocks?  Of the next level
> > down?
> > +
> > +3. Free space btree records are cross-referenced as follows:
> > +
> > +   - Everything in class 1 and 2 above.
> > +
> > +   - Does the reverse mapping index list no owners of this space?
> > +
> > +   - Is this space not claimed by the inode index for inodes?
> > +
> > +   - Is it not mentioned by the reference count index?
> > +
> > +   - Is there a matching record in the other free space btree?
> > +
> > +4. Inode btree records are cross-referenced as follows:
> > +
> > +   - Everything in class 1 and 2 above.
> > +
> > +   - Is there a matching record in free inode btree?
> > +
> > +   - Do cleared bits in the holemask correspond with inode clusters?
> > +
> > +   - Do set bits in the freemask correspond with inode records with
> > zero link
> > +     count?
> > +
> > +5. Inode records are cross-referenced as follows:
> > +
> > +   - Everything in class 1.
> > +
> > +   - Do all the fields that summarize information about the file
> > forks actually
> > +     match those forks?
> > +
> > +   - Does each inode with zero link count correspond to a record in
> > the free
> > +     inode btree?
> > +
> > +6. File fork space mapping records are cross-referenced as follows:
> > +
> > +   - Everything in class 1 and 2 above.
> > +
> > +   - Is this space not mentioned by the inode btrees?
> > +
> > +   - If this is a CoW fork mapping, does it correspond to a CoW
> > entry in the
> > +     reference count btree?
> > +
> > +7. Reference count records are cross-referenced as follows:
> > +
> > +   - Everything in class 1 and 2 above.
> > +
> > +   - Within the space subkeyspace of the rmap btree (that is to say,
> > all
> > +     records mapped to a particular space extent and ignoring the
> > owner info),
> > +     are there the same number of reverse mapping records for each
> > block as the
> > +     reference count record claims?
> > +
> > +Proposed patchsets are the series to find gaps in
> > +`refcount btree
> > +<
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > log/?h=scrub-detect-refcount-gaps>`_,
> > +`inode btree
> > +<
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > log/?h=scrub-detect-inobt-gaps>`_, and
> > +`rmap btree
> > +<
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > log/?h=scrub-detect-rmapbt-gaps>`_ records;
> > +to find
> > +`mergeable records
> > +<
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > log/?h=scrub-detect-mergeable-records>`_;
> > +and to
> > +`improve cross referencing with rmap
> > +<
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > log/?h=scrub-strengthen-rmap-checking>`_
> > +before starting a repair.
> > +
> > +Checking Extended Attributes
> > +````````````````````````````
> > +
> > +Extended attributes implement a key-value store that enable
> > fragments of data
> > +to be attached to any file.
> > +Both the kernel and userspace can access the keys and values,
> > subject to
> > +namespace and privilege restrictions.
> > +Most typically these fragments are metadata about the file --
> > origins, security
> > +contexts, user-supplied labels, indexing information, etc.
> > +
> > +Names can be as long as 255 bytes and can exist in several different
> > +namespaces.
> > +Values can be as large as 64KB.
> > +A file's extended attributes are stored in blocks mapped by the attr
> > fork.
> > +The mappings point to leaf blocks, remote value blocks, or dabtree
> > blocks.
> > +Block 0 in the attribute fork is always the top of the structure,
> > but otherwise
> > +each of the three types of blocks can be found at any offset in the
> > attr fork.
> > +Leaf blocks contain attribute key records that point to the name and
> > the value.
> > +Names are always stored elsewhere in the same leaf block.
> > +Values that are less than 3/4 the size of a filesystem block are
> > also stored
> > +elsewhere in the same leaf block.
> > +Remote value blocks contain values that are too large to fit inside
> > a leaf.
> > +If the leaf information exceeds a single filesystem block, a dabtree
> > (also
> > +rooted at block 0) is created to map hashes of the attribute names
> > to leaf
> > +blocks in the attr fork.
> > +
> > +Checking an extended attribute structure is not so straightfoward
> > due to the
> > +lack of separation between attr blocks and index blocks.
> > +Scrub must read each block mapped by the attr fork and ignore the
> > non-leaf
> > +blocks:
> > +
> > +1. Walk the dabtree in the attr fork (if present) to ensure that
> > there are no
> > +   irregularities in the blocks or dabtree mappings that do not
> > point to
> > +   attr leaf blocks.
> > +
> > +2. Walk the blocks of the attr fork looking for leaf blocks.
> > +   For each entry inside a leaf:
> > +
> > +   a. Validate that the name does not contain invalid characters.
> > +
> > +   b. Read the attr value.
> > +      This performs a named lookup of the attr name to ensure the
> > correctness
> > +      of the dabtree.
> > +      If the value is stored in a remote block, this also validates
> > the
> > +      integrity of the remote value block.
> > +
> > +Checking and Cross-Referencing Directories
> > +``````````````````````````````````````````
> > +
> > +The filesystem directory tree is a directed acylic graph structure,
> > with files
> > +constituting the nodes, and directory entries (dirents) constituting
> > the edges.
> > +Directories are a special type of file containing a set of mappings
> > from a
> > +255-byte sequence (name) to an inumber.
> > +These are called directory entries, or dirents for short.
> > +Each directory file must have exactly one directory pointing to the
> > file.
> > +A root directory points to itself.
> > +Directory entries point to files of any type.
> > +Each non-directory file may have multiple directories point to it.
> > +
> > +In XFS, directories are implemented as a file containing up to three
> > 32GB
> > +partitions.
> > +The first partition contains directory entry data blocks.
> > +Each data block contains variable-sized records associating a user-
> > provided
> > +name with an inumber and, optionally, a file type.
> > +If the directory entry data grows beyond one block, the second
> > partition (which
> > +exists as post-EOF extents) is populated with a block containing
> > free space
> > +information and an index that maps hashes of the dirent names to
> > directory data
> > +blocks in the first partition.
> > +This makes directory name lookups very fast.
> > +If this second partition grows beyond one block, the third partition
> > is
> > +populated with a linear array of free space information for faster
> > +expansions.
> > +If the free space has been separated and the second partition grows
> > again
> > +beyond one block, then a dabtree is used to map hashes of dirent
> > names to
> > +directory data blocks.
> > +
> > +Checking a directory is pretty straightfoward:
> > +
> > +1. Walk the dabtree in the second partition (if present) to ensure
> > that there
> > +   are no irregularities in the blocks or dabtree mappings that do
> > not point to
> > +   dirent blocks.
> > +
> > +2. Walk the blocks of the first partition looking for directory
> > entries.
> > +   Each dirent is checked as follows:
> > +
> > +   a. Does the name contain no invalid characters?
> > +
> > +   b. Does the inumber correspond to an actual, allocated inode?
> > +
> > +   c. Does the child inode have a nonzero link count?
> > +
> > +   d. If a file type is included in the dirent, does it match the
> > type of the
> > +      inode?
> > +
> > +   e. If the child is a subdirectory, does the child's dotdot
> > pointer point
> > +      back to the parent?
> > +
> > +   f. If the directory has a second partition, perform a named
> > lookup of the
> > +      dirent name to ensure the correctness of the dabtree.
> > +
> > +3. Walk the free space list in the third partition (if present) to
> > ensure that
> > +   the free spaces it describes are really unused.
> > +
> > +Checking operations involving :ref:`parents <dirparent>` and
> > +:ref:`file link counts <nlinks>` are discussed in more detail in
> > later
> > +sections.
> > +
> > +Checking Directory/Attribute Btrees
> > +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > +
> > +As stated in previous sections, the directory/attribute btree
> > (dabtree) index
> > +maps user-provided names to improve lookup times by avoiding linear
> > scans.
> > +Internally, it maps a 32-bit hash of the name to a block offset
> > within the
> > +appropriate file fork.
> > +
> > +The internal structure of a dabtree closely resembles the btrees
> > that record
> > +fixed-size metadata records -- each dabtree block contains a magic
> > number, a
> > +checksum, sibling pointers, a UUID, a tree level, and a log sequence
> > number.
> > +The format of leaf and node records are the same -- each entry
> > points to the
> > +next level down in the hierarchy, with dabtree node records pointing
> > to dabtree
> > +leaf blocks, and dabtree leaf records pointing to non-dabtree blocks
> > elsewhere
> > +in the fork.
> > +
> > +Checking and cross-referencing the dabtree is very similar to what
> > is done for
> > +space btrees:
> > +
> > +- Does the type of data stored in the block match what scrub is
> > expecting?
> > +
> > +- Does the block belong to the owning structure that asked for the
> > read?
> > +
> > +- Do the records fit within the block?
> > +
> > +- Are the records contained inside the block free of obvious
> > corruptions?
> > +
> > +- Are the name hashes in the correct order?
> > +
> > +- Do node pointers within the dabtree point to valid fork offsets
> > for dabtree
> > +  blocks?
> > +
> > +- Do leaf pointers within the dabtree point to valid fork offsets
> > for directory
> > +  or attr leaf blocks?
> > +
> > +- Do child pointers point towards the leaves?
> > +
> > +- Do sibling pointers point across the same level?
> > +
> > +- For each dabtree node record, does the record key accurate reflect
> > the
> > +  contents of the child dabtree block?
> > +
> > +- For each dabtree leaf record, does the record key accurate reflect
> > the
> > +  contents of the directory or attr block?
> > +
> > +Cross-Referencing Summary Counters
> > +``````````````````````````````````
> > +
> > +XFS maintains three classes of summary counters: available
> > resources, quota
> > +resource usage, and file link counts.
> > +
> > +In theory, the amount of available resources (data blocks, inodes,
> > realtime
> > +extents) can be found by walking the entire filesystem.
> > +This would make for very slow reporting, so a transactional
> > filesystem can
> > +maintain summaries of this information in the superblock.
> > +Cross-referencing these values against the filesystem metadata
> > should be a
> > +simple matter of walking the free space and inode metadata in each
> > AG and the
> > +realtime bitmap, but there are complications that will be discussed
> > in
> > +:ref:`more detail <fscounters>` later.
> > +
> > +:ref:`Quota usage <quotacheck>` and :ref:`file link count <nlinks>`
> > +checking are sufficiently complicated to warrant separate sections.
> > +
> > +Post-Repair Reverification
> > +``````````````````````````
> > +
> > +After performing a repair, the checking code is run a second time to
> > validate
> > +the new structure, and the results of the health assessment are
> > recorded
> > +internally and returned to the calling process.
> > +This step is critical for enabling system administrator to monitor
> > the status
> > +of the filesystem and the progress of any repairs.
> > +For developers, it is a useful means to judge the efficacy of error
> > detection
> > +and correction in the online and offline checking tools.
> > diff --git a/Documentation/filesystems/xfs-self-describing-
> > metadata.rst b/Documentation/filesystems/xfs-self-describing-
> > metadata.rst
> > index b79dbf36dc94..a10c4ae6955e 100644
> > --- a/Documentation/filesystems/xfs-self-describing-metadata.rst
> > +++ b/Documentation/filesystems/xfs-self-describing-metadata.rst
> > @@ -1,4 +1,5 @@
> >  .. SPDX-License-Identifier: GPL-2.0
> > +.. _xfs_self_describing_metadata:
> >  
> >  ============================
> >  XFS Self Describing Metadata
> > 
>
Allison Henderson Feb. 9, 2023, 5:41 a.m. UTC | #3
On Thu, 2023-02-02 at 11:04 -0800, Darrick J. Wong wrote:
> On Sat, Jan 21, 2023 at 01:38:33AM +0000, Allison Henderson wrote:
> > On Fri, 2022-12-30 at 14:10 -0800, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <djwong@kernel.org>
> > > 
> > > Begin the fifth chapter of the online fsck design documentation,
> > > where
> > > we discuss the details of the data structures and algorithms used
> > > by
> > > the
> > > kernel to examine filesystem metadata and cross-reference it
> > > around
> > > the
> > > filesystem.
> > > 
> > > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > > ---
> > >  .../filesystems/xfs-online-fsck-design.rst         |  579
> > > ++++++++++++++++++++
> > >  .../filesystems/xfs-self-describing-metadata.rst   |    1 
> > >  2 files changed, 580 insertions(+)
> > > 
> > > 
> > > diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst
> > > b/Documentation/filesystems/xfs-online-fsck-design.rst
> > > index 42e82971e036..f45bf97fa9c4 100644
> > > --- a/Documentation/filesystems/xfs-online-fsck-design.rst
> > > +++ b/Documentation/filesystems/xfs-online-fsck-design.rst
> > > @@ -864,3 +864,582 @@ Proposed patchsets include
> > >  and
> > >  `preservation of sickness info during memory reclaim
> > >  <
> > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > > log/?h=indirect-health-reporting>`_.
> > > +
> > > +5. Kernel Algorithms and Data Structures
> > > +========================================
> > > +
> > > +This section discusses the key algorithms and data structures of
> > > the
> > > kernel
> > > +code that provide the ability to check and repair metadata while
> > > the
> > > system
> > > +is running.
> > > +The first chapters in this section reveal the pieces that
> > > provide
> > > the
> > > +foundation for checking metadata.
> > > +The remainder of this section presents the mechanisms through
> > > which
> > > XFS
> > > +regenerates itself.
> > > +
> > > +Self Describing Metadata
> > > +------------------------
> > > +
> > > +Starting with XFS version 5 in 2012, XFS updated the format of
> > > nearly every
> > > +ondisk block header to record a magic number, a checksum, a
> > > universally
> > > +"unique" identifier (UUID), an owner code, the ondisk address of
> > > the
> > > block,
> > > +and a log sequence number.
> > > +When loading a block buffer from disk, the magic number, UUID,
> > > owner, and
> > > +ondisk address confirm that the retrieved block matches the
> > > specific
> > > owner of
> > > +the current filesystem, and that the information contained in
> > > the
> > > block is
> > > +supposed to be found at the ondisk address.
> > > +The first three components enable checking tools to disregard
> > > alleged metadata
> > > +that doesn't belong to the filesystem, and the fourth component
> > > enables the
> > > +filesystem to detect lost writes.
> > Add...
> > 
> > "When ever a file system operation modifies a block, the change is
> > submitted to the journal as a transaction.  The journal then
> > processes
> > these transactions marking them done once they are safely committed
> > to
> > the disk"
> 
> Ok, I'll add that transition.  Though I'll s/journal/log/ since this
> is
> xfs. :)
> 
> > At this point we havnt talked much at all about transactions or
> > logs,
> > and we've just barely begin to cover blocks.  I think you at least
> > want
> > a quick blip to describe the relation of these two things, or it
> > may
> > not be clear why we suddenly jumped into logs.
> 
> Point taken.  Thanks for the suggestion.
> 
> > > +
> > > +The logging code maintains the checksum and the log sequence
> > > number
> > > of the last
> > > +transactional update.
> > > +Checksums are useful for detecting torn writes and other
> > > mischief
> > "Checksums (or crc's) are useful for detecting incomplete or torn
> > writes as well as other discrepancies..."
> 
> Checksums are a general concept, whereas CRCs denote a particular
> family
> of checksums.  The statement would still apply even if we used a
> different family (e.g. erasure codes, cryptographic hash functions)
> of
> function instead of crc32c.
> 
> I will, however, avoid the undefined term 'mischief'.  Thanks for the
> correction.
> 
> "Checksums are useful for detecting torn writes and other
> discrepancies
> that can be introduced between the computer and its storage devices."
> 
> > > between the
> > > +computer and its storage devices.
> > > +Sequence number tracking enables log recovery to avoid applying
> > > out
> > > of date
> > > +log updates to the filesystem.
> > > +
> > > +These two features improve overall runtime resiliency by
> > > providing a
> > > means for
> > > +the filesystem to detect obvious corruption when reading
> > > metadata
> > > blocks from
> > > +disk, but these buffer verifiers cannot provide any consistency
> > > checking
> > > +between metadata structures.
> > > +
> > > +For more information, please see the documentation for
> > > +Documentation/filesystems/xfs-self-describing-metadata.rst
> > > +
> > > +Reverse Mapping
> > > +---------------
> > > +
> > > +The original design of XFS (circa 1993) is an improvement upon
> > > 1980s
> > > Unix
> > > +filesystem design.
> > > +In those days, storage density was expensive, CPU time was
> > > scarce,
> > > and
> > > +excessive seek time could kill performance.
> > > +For performance reasons, filesystem authors were reluctant to
> > > add
> > > redundancy to
> > > +the filesystem, even at the cost of data integrity.
> > > +Filesystems designers in the early 21st century choose different
> > > strategies to
> > > +increase internal redundancy -- either storing nearly identical
> > > copies of
> > > +metadata, or more space-efficient techniques such as erasure
> > > coding.
> > "such as erasure coding which may encode sections of the data with
> > redundant symbols and in more than one location"
> > 
> > That ties it into the next line.  If you go on to talk about a term
> > you
> > have not previously defined, i think you want to either define it
> > quickly or just drop it all together.  Right now your goal is to
> > just
> > give the reader context, so you want it to move quickly.
> 
> How about I shorten it to:
> 
> "...or more space-efficient encoding techniques." ?
Sure, I think that would be fine

> 
> and end the paragraph there?
> 
> > > +Obvious corruptions are typically repaired by copying replicas
> > > or
> > > +reconstructing from codes.
> > > +
> > I think I would have just jumped straight from xfs history to
> > modern
> > xfs...
> > > +For XFS, a different redundancy strategy was chosen to modernize
> > > the
> > > design:
> > > +a secondary space usage index that maps allocated disk extents
> > > back
> > > to their
> > > +owners.
> > > +By adding a new index, the filesystem retains most of its
> > > ability to
> > > scale
> > > +well to heavily threaded workloads involving large datasets,
> > > since
> > > the primary
> > > +file metadata (the directory tree, the file block map, and the
> > > allocation
> > > +groups) remain unchanged.
> > > 
> > 
> > > +Although the reverse-mapping feature increases overhead costs
> > > for
> > > space
> > > +mapping activities just like any other system that improves
> > > redundancy, it
> > "Like any system that improves redundancy, the reverse-mapping
> > feature
> > increases overhead costs for space mapping activities. However,
> > it..."
> 
> I like this better.  These two sentences have been changed to read:
> 
> "Like any system that improves redundancy, the reverse-mapping
> feature
> increases overhead costs for space mapping activities.  However, it
> has
> two critical advantages: first, the reverse index is key to enabling
> online fsck and other requested functionality such as free space
> defragmentation, better media failure reporting, and filesystem
> shrinking."
Alrighty, sounds good

> 
> > > +has two critical advantages: first, the reverse index is key to
> > > enabling online
> > > +fsck and other requested functionality such as filesystem
> > > reorganization,
> > > +better media failure reporting, and shrinking.
> > > +Second, the different ondisk storage format of the reverse
> > > mapping
> > > btree
> > > +defeats device-level deduplication, because the filesystem
> > > requires
> > > real
> > > +redundancy.
> > > +
> > > +A criticism of adding the secondary index is that it does
> > > nothing to
> > > improve
> > > +the robustness of user data storage itself.
> > > +This is a valid point, but adding a new index for file data
> > > block
> > > checksums
> > > +increases write amplification and turns data overwrites into
> > > copy-
> > > writes, which
> > > +age the filesystem prematurely.
> > > +In keeping with thirty years of precedent, users who want file
> > > data
> > > integrity
> > > +can supply as powerful a solution as they require.
> > > +As for metadata, the complexity of adding a new secondary index
> > > of
> > > space usage
> > > +is much less than adding volume management and storage device
> > > mirroring to XFS
> > > +itself.
> > > +Perfection of RAID and volume management are best left to
> > > existing
> > > layers in
> > > +the kernel.
> > I think I would cull the entire above paragraph.  rmap, crc and
> > raid
> > all have very different points of redundancy, so criticism that an
> > apple is not an orange or visavis just feels like a shortsighted
> > comparison that's probably more of a distraction than anything.
> > 
> > Sometimes it feels like this document kinda gets off into tangents
> > like it's preemptively trying to position it's self for an argument
> > that hasn't happened yet.
> 
> It does!  Each of the many tangents that you've pointed out are a
> reaction to some discussion that we've had on the list, or at an
> LSF, or <cough> fs nerds sniping on social media.  The reason I
> capture all of these offtopic arguments is to discourage people from
> wasting time rehashing discussions that were settled long ago.
> 
> Admittedly, that is a very defensive reaction on my part...
> 
> > But I think it has the effect of pulling the
> > readers attention off topic into an argument they never thought to
> > consider in the first place.  The topic of this section is to
> > explain
> > what rmap is.  So lets stay on topic and finish laying out that
> > ground
> > work first before getting into how it compares to other solutions
> 
> ...and you're right to point out that mentioning these things is
> distracting and provides fuel to reignite a flamewar.  At the same
> time,
> I think there's value in identifying the roads not taken, and why.
> 
> What if I turned these tangents into explicitly labelled sidebars?
> Would that help readers who want to stick to the topic?
> 
Sure, I think that would be a reasonable compromise

> > > +
> > > +The information captured in a reverse space mapping record is as
> > > follows:
> > > +
> > > +.. code-block:: c
> > > +
> > > +       struct xfs_rmap_irec {
> > > +           xfs_agblock_t    rm_startblock;   /* extent start
> > > block
> > > */
> > > +           xfs_extlen_t     rm_blockcount;   /* extent length */
> > > +           uint64_t         rm_owner;        /* extent owner */
> > > +           uint64_t         rm_offset;       /* offset within
> > > the
> > > owner */
> > > +           unsigned int     rm_flags;        /* state flags */
> > > +       };
> > > +
> > > +The first two fields capture the location and size of the
> > > physical
> > > space,
> > > +in units of filesystem blocks.
> > > +The owner field tells scrub which metadata structure or file
> > > inode
> > > have been
> > > +assigned this space.
> > > +For space allocated to files, the offset field tells scrub where
> > > the
> > > space was
> > > +mapped within the file fork.
> > > +Finally, the flags field provides extra information about the
> > > space
> > > usage --
> > > +is this an attribute fork extent?  A file mapping btree extent? 
> > > Or
> > > an
> > > +unwritten data extent?
> > > +
> > > +Online filesystem checking judges the consistency of each
> > > primary
> > > metadata
> > > +record by comparing its information against all other space
> > > indices.
> > > +The reverse mapping index plays a key role in the consistency
> > > checking process
> > > +because it contains a centralized alternate copy of all space
> > > allocation
> > > +information.
> > > +Program runtime and ease of resource acquisition are the only
> > > real
> > > limits to
> > > +what online checking can consult.
> > > +For example, a file data extent mapping can be checked against:
> > > +
> > > +* The absence of an entry in the free space information.
> > > +* The absence of an entry in the inode index.
> > > +* The absence of an entry in the reference count data if the
> > > file is
> > > not
> > > +  marked as having shared extents.
> > > +* The correspondence of an entry in the reverse mapping
> > > information.
> > > +
> > > +A key observation here is that only the reverse mapping can
> > > provide
> > > a positive
> > > +affirmation of correctness if the primary metadata is in doubt.
> > if any of the above metadata is in doubt...
> 
> Fixed.
> 
> > > +The checking code for most primary metadata follows a path
> > > similar
> > > to the
> > > +one outlined above.
> > > +
> > > +A second observation to make about this secondary index is that
> > > proving its
> > > +consistency with the primary metadata is difficult.
> > 
> > > +Demonstrating that a given reverse mapping record exactly
> > > corresponds to the
> > > +primary space metadata involves a full scan of all primary space
> > > metadata,
> > > +which is very time intensive.
> > "But why?" Wonders the reader. Just jump into an example:
> > 
> > "In order to verify that an rmap extent does not incorrectly over
> > lap
> > with another record, we would need a full scan of all the other
> > records, which is time intensive."
> 
> I want to shorten it even further:
> 
> "Validating that reverse mapping records are correct requires a full
> scan of all primary space metadata, which is very time intensive."
Ok, I think that sounds fine

> 
> > 
> > ?
> > 
> > And then the below is a separate observation right?  
> 
> Right.
> 
> > > +Scanning activity for online fsck can only use non-blocking lock
> > > acquisition
> > > +primitives if the locking order is not the regular order as used
> > > by
> > > the rest of
> > > +the filesystem.
> > Lastly, it should be noted that most file system operations tend to
> > lock primary metadata before locking the secondary metadata.
> 
> This isn't accurate -- metadata structures don't have separate locks.
> So it's not true to say that we lock primary or secondary metadata.
> 
> We /can/ say that file operations lock the inode, then the AGI, then
> the
> AGF; or that directory operations lock the parent and child ILOCKs in
> inumber order; and that if scrub wants to take locks in any other
> order,
> it can only do that via trylocks and backoff.
I see, ok maybe giving one or both of those examples is clearer then

> 
> > This
> > means that scanning operations that acquire the secondary metadata
> > first may need to yield the secondary lock to filesystem operations
> > that have already acquired the primary lock. 
> > 
> > ?
> > 
> > > +This means that forward progress during this part of a scan of
> > > the
> > > reverse
> > > +mapping data cannot be guaranteed if system load is especially
> > > heavy.
> > > +Therefore, it is not practical for online check to detect
> > > reverse
> > > mapping
> > > +records that lack a counterpart in the primary metadata.
> > Such as <quick list / quick example>
> > 
> > > +Instead, scrub relies on rigorous cross-referencing during the
> > > primary space
> > > +mapping structure checks.
> 
> I've converted this section into a bullet list:
> 
> "There are several observations to make about reverse mapping
> indices:
> 
> "1. Reverse mappings can provide a positive affirmation of
> correctness if
> any of the above primary metadata are in doubt.  The checking code
> for
> most primary metadata follows a path similar to the one outlined
> above.
> 
> "2. Proving the consistency of secondary metadata with the primary
> metadata is difficult because that requires a full scan of all
> primary
> space metadata, which is very time intensive.  For example, checking
> a
> reverse mapping record for a file extent mapping btree block requires
> locking the file and searching the entire btree to confirm the block.
> Instead, scrub relies on rigorous cross-referencing during the
> primary
> space mapping structure checks.
> 
> "3. Consistency scans must use non-blocking lock acquisition
> primitives
> if the required locking order is not the same order used by regular
> filesystem operations.  This means that forward progress during this
> part of a scan of the reverse mapping data cannot be guaranteed if
> system load is heavy."
Ok, I think that reads cleaner

> 
> > > +
> > 
> > The below paragraph sounds like a re-cap?
> > 
> > "So to recap, reverse mappings also...."
> 
> Yep.
> 
> > > +Reverse mappings also play a key role in reconstruction of
> > > primary
> > > metadata.
> > > +The secondary information is general enough for online repair to
> > > synthesize a
> > > +complete copy of any primary space management metadata by
> > > locking
> > > that
> > > +resource, querying all reverse mapping indices looking for
> > > records
> > > matching
> > > +the relevant resource, and transforming the mapping into an
> > > appropriate format.
> > > +The details of how these records are staged, written to disk,
> > > and
> > > committed
> > > +into the filesystem are covered in subsequent sections.
> > I also think the section would be ok if you were to trim off this
> > last
> > paragraph too.
> 
> Hm.  I still want to set up the expectation that there's more to
> come.
> How about a brief two-sentence transition paragraph:
> 
> "In summary, reverse mappings play a key role in reconstruction of
> primary metadata.  The details of how these records are staged,
> written
> to disk, and committed into the filesystem are covered in subsequent
> sections."
Ok, I think that's a cleaner wrap up
> 
> > 
> > > +
> > > +Checking and Cross-Referencing
> > > +------------------------------
> > > +
> > > +The first step of checking a metadata structure is to examine
> > > every
> > > record
> > > +contained within the structure and its relationship with the
> > > rest of
> > > the
> > > +system.
> > > +XFS contains multiple layers of checking to try to prevent
> > > inconsistent
> > > +metadata from wreaking havoc on the system.
> > > +Each of these layers contributes information that helps the
> > > kernel
> > > to make
> > > +three decisions about the health of a metadata structure:
> > > +
> > > +- Is a part of this structure obviously corrupt
> > > (``XFS_SCRUB_OFLAG_CORRUPT``) ?
> > > +- Is this structure inconsistent with the rest of the system
> > > +  (``XFS_SCRUB_OFLAG_XCORRUPT``) ?
> > > +- Is there so much damage around the filesystem that cross-
> > > referencing is not
> > > +  possible (``XFS_SCRUB_OFLAG_XFAIL``) ?
> > > +- Can the structure be optimized to improve performance or
> > > reduce
> > > the size of
> > > +  metadata (``XFS_SCRUB_OFLAG_PREEN``) ?
> > > +- Does the structure contain data that is not inconsistent but
> > > deserves review
> > > +  by the system administrator (``XFS_SCRUB_OFLAG_WARNING``) ?
> > > +
> > > +The following sections describe how the metadata scrubbing
> > > process
> > > works.
> > > +
> > > +Metadata Buffer Verification
> > > +````````````````````````````
> > > +
> > > +The lowest layer of metadata protection in XFS are the metadata
> > > verifiers built
> > > +into the buffer cache.
> > > +These functions perform inexpensive internal consistency
> > > checking of
> > > the block
> > > +itself, and answer these questions:
> > > +
> > > +- Does the block belong to this filesystem?
> > > +
> > > +- Does the block belong to the structure that asked for the
> > > read?
> > > +  This assumes that metadata blocks only have one owner, which
> > > is
> > > always true
> > > +  in XFS.
> > > +
> > > +- Is the type of data stored in the block within a reasonable
> > > range
> > > of what
> > > +  scrub is expecting?
> > > +
> > > +- Does the physical location of the block match the location it
> > > was
> > > read from?
> > > +
> > > +- Does the block checksum match the data?
> > > +
> > > +The scope of the protections here are very limited -- verifiers
> > > can
> > > only
> > > +establish that the filesystem code is reasonably free of gross
> > > corruption bugs
> > > +and that the storage system is reasonably competent at
> > > retrieval.
> > > +Corruption problems observed at runtime cause the generation of
> > > health reports,
> > > +failed system calls, and in the extreme case, filesystem
> > > shutdowns
> > > if the
> > > +corrupt metadata force the cancellation of a dirty transaction.
> > > +
> > > +Every online fsck scrubbing function is expected to read every
> > > ondisk metadata
> > > +block of a structure in the course of checking the structure.
> > > +Corruption problems observed during a check are immediately
> > > reported
> > > to
> > > +userspace as corruption; during a cross-reference, they are
> > > reported
> > > as a
> > > +failure to cross-reference once the full examination is
> > > complete.
> > > +Reads satisfied by a buffer already in cache (and hence already
> > > verified)
> > > +bypass these checks.
> > > +
> > > +Internal Consistency Checks
> > > +```````````````````````````
> > > +
> > > +The next higher level of metadata protection is the internal
> > > record
> > "After the buffer cache, the next level of metadata protection
> > is..."
> 
> Changed.  I'll do the same to the next section as well.
> 
> > > +verification code built into the filesystem.
> > 
> > > +These checks are split between the buffer verifiers, the in-
> > > filesystem users of
> > > +the buffer cache, and the scrub code itself, depending on the
> > > amount
> > > of higher
> > > +level context required.
> > > +The scope of checking is still internal to the block.
> > > +For performance reasons, regular code may skip some of these
> > > checks
> > > unless
> > > +debugging is enabled or a write is about to occur.
> > > +Scrub functions, of course, must check all possible problems.
> > I'd put this chunk after the list below.
> > 
> > > +Either way, these higher level checking functions answer these
> > > questions:
> > Then this becomes:
> > "These higher level checking functions..."
> 
> Done.
> 
> > > +
> > > +- Does the type of data stored in the block match what scrub is
> > > expecting?
> > > +
> > > +- Does the block belong to the owning structure that asked for
> > > the
> > > read?
> > > +
> > > +- If the block contains records, do the records fit within the
> > > block?
> > > +
> > > +- If the block tracks internal free space information, is it
> > > consistent with
> > > +  the record areas?
> > > +
> > > +- Are the records contained inside the block free of obvious
> > > corruptions?
> > > +
> > > +Record checks in this category are more rigorous and more time-
> > > intensive.
> > > +For example, block pointers and inumbers are checked to ensure
> > > that
> > > they point
> > > +within the dynamically allocated parts of an allocation group
> > > and
> > > within
> > > +the filesystem.
> > > +Names are checked for invalid characters, and flags are checked
> > > for
> > > invalid
> > > +combinations.
> > > +Other record attributes are checked for sensible values.
> > > +Btree records spanning an interval of the btree keyspace are
> > > checked
> > > for
> > > +correct order and lack of mergeability (except for file fork
> > > mappings).
> > > +
> > > +Validation of Userspace-Controlled Record Attributes
> > > +````````````````````````````````````````````````````
> > > +
> > > +Various pieces of filesystem metadata are directly controlled by
> > > userspace.
> > > +Because of this nature, validation work cannot be more precise
> > > than
> > > checking
> > > +that a value is within the possible range.
> > > +These fields include:
> > > +
> > > +- Superblock fields controlled by mount options
> > > +- Filesystem labels
> > > +- File timestamps
> > > +- File permissions
> > > +- File size
> > > +- File flags
> > > +- Names present in directory entries, extended attribute keys,
> > > and
> > > filesystem
> > > +  labels
> > > +- Extended attribute key namespaces
> > > +- Extended attribute values
> > > +- File data block contents
> > > +- Quota limits
> > > +- Quota timer expiration (if resource usage exceeds the soft
> > > limit)
> > > +
> > > +Cross-Referencing Space Metadata
> > > +````````````````````````````````
> > > +
> > > +The next higher level of checking is cross-referencing records
> > > between metadata
> > 
> > I kinda like the list first so that the reader has an idea of what
> > these checks are before getting into discussion about them.  It
> > just
> > makes it a little more obvious as to why it's "prohibitively
> > expensive"
> > or "dependent on the context of the structure" after having just
> > looked
> > at it
> 
> <nod>
> 
> > The rest looks good from here.
> 
> Woot.  Onto the next reply! :)
> 
> --D
> 
> > Allison
> > 
> > > +structures.
> > > +For regular runtime code, the cost of these checks is considered
> > > to
> > > be
> > > +prohibitively expensive, but as scrub is dedicated to rooting
> > > out
> > > +inconsistencies, it must pursue all avenues of inquiry.
> > > +The exact set of cross-referencing is highly dependent on the
> > > context of the
> > > +data structure being checked.
> > > +
> > > +The XFS btree code has keyspace scanning functions that online
> > > fsck
> > > uses to
> > > +cross reference one structure with another.
> > > +Specifically, scrub can scan the key space of an index to
> > > determine
> > > if that
> > > +keyspace is fully, sparsely, or not at all mapped to records.
> > > +For the reverse mapping btree, it is possible to mask parts of
> > > the
> > > key for the
> > > +purposes of performing a keyspace scan so that scrub can decide
> > > if
> > > the rmap
> > > +btree contains records mapping a certain extent of physical
> > > space
> > > without the
> > > +sparsenses of the rest of the rmap keyspace getting in the way.
> > > +
> > > +Btree blocks undergo the following checks before cross-
> > > referencing:
> > > +
> > > +- Does the type of data stored in the block match what scrub is
> > > expecting?
> > > +
> > > +- Does the block belong to the owning structure that asked for
> > > the
> > > read?
> > > +
> > > +- Do the records fit within the block?
> > > +
> > > +- Are the records contained inside the block free of obvious
> > > corruptions?
> > > +
> > > +- Are the name hashes in the correct order?
> > > +
> > > +- Do node pointers within the btree point to valid block
> > > addresses
> > > for the type
> > > +  of btree?
> > > +
> > > +- Do child pointers point towards the leaves?
> > > +
> > > +- Do sibling pointers point across the same level?
> > > +
> > > +- For each node block record, does the record key accurate
> > > reflect
> > > the contents
> > > +  of the child block?
> > > +
> > > +Space allocation records are cross-referenced as follows:
> > > +
> > > +1. Any space mentioned by any metadata structure are cross-
> > > referenced as
> > > +   follows:
> > > +
> > > +   - Does the reverse mapping index list only the appropriate
> > > owner
> > > as the
> > > +     owner of each block?
> > > +
> > > +   - Are none of the blocks claimed as free space?
> > > +
> > > +   - If these aren't file data blocks, are none of the blocks
> > > claimed as space
> > > +     shared by different owners?
> > > +
> > > +2. Btree blocks are cross-referenced as follows:
> > > +
> > > +   - Everything in class 1 above.
> > > +
> > > +   - If there's a parent node block, do the keys listed for this
> > > block match the
> > > +     keyspace of this block?
> > > +
> > > +   - Do the sibling pointers point to valid blocks?  Of the same
> > > level?
> > > +
> > > +   - Do the child pointers point to valid blocks?  Of the next
> > > level
> > > down?
> > > +
> > > +3. Free space btree records are cross-referenced as follows:
> > > +
> > > +   - Everything in class 1 and 2 above.
> > > +
> > > +   - Does the reverse mapping index list no owners of this
> > > space?
> > > +
> > > +   - Is this space not claimed by the inode index for inodes?
> > > +
> > > +   - Is it not mentioned by the reference count index?
> > > +
> > > +   - Is there a matching record in the other free space btree?
> > > +
> > > +4. Inode btree records are cross-referenced as follows:
> > > +
> > > +   - Everything in class 1 and 2 above.
> > > +
> > > +   - Is there a matching record in free inode btree?
> > > +
> > > +   - Do cleared bits in the holemask correspond with inode
> > > clusters?
> > > +
> > > +   - Do set bits in the freemask correspond with inode records
> > > with
> > > zero link
> > > +     count?
> > > +
> > > +5. Inode records are cross-referenced as follows:
> > > +
> > > +   - Everything in class 1.
> > > +
> > > +   - Do all the fields that summarize information about the file
> > > forks actually
> > > +     match those forks?
> > > +
> > > +   - Does each inode with zero link count correspond to a record
> > > in
> > > the free
> > > +     inode btree?
> > > +
> > > +6. File fork space mapping records are cross-referenced as
> > > follows:
> > > +
> > > +   - Everything in class 1 and 2 above.
> > > +
> > > +   - Is this space not mentioned by the inode btrees?
> > > +
> > > +   - If this is a CoW fork mapping, does it correspond to a CoW
> > > entry in the
> > > +     reference count btree?
> > > +
> > > +7. Reference count records are cross-referenced as follows:
> > > +
> > > +   - Everything in class 1 and 2 above.
> > > +
> > > +   - Within the space subkeyspace of the rmap btree (that is to
> > > say,
> > > all
> > > +     records mapped to a particular space extent and ignoring
> > > the
> > > owner info),
> > > +     are there the same number of reverse mapping records for
> > > each
> > > block as the
> > > +     reference count record claims?
> > > +
> > > +Proposed patchsets are the series to find gaps in
> > > +`refcount btree
> > > +<
> > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > > log/?h=scrub-detect-refcount-gaps>`_,
> > > +`inode btree
> > > +<
> > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > > log/?h=scrub-detect-inobt-gaps>`_, and
> > > +`rmap btree
> > > +<
> > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > > log/?h=scrub-detect-rmapbt-gaps>`_ records;
> > > +to find
> > > +`mergeable records
> > > +<
> > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > > log/?h=scrub-detect-mergeable-records>`_;
> > > +and to
> > > +`improve cross referencing with rmap
> > > +<
> > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > > log/?h=scrub-strengthen-rmap-checking>`_
> > > +before starting a repair.
> > > +
> > > +Checking Extended Attributes
> > > +````````````````````````````
> > > +
> > > +Extended attributes implement a key-value store that enable
> > > fragments of data
> > > +to be attached to any file.
> > > +Both the kernel and userspace can access the keys and values,
> > > subject to
> > > +namespace and privilege restrictions.
> > > +Most typically these fragments are metadata about the file --
> > > origins, security
> > > +contexts, user-supplied labels, indexing information, etc.
> > > +
> > > +Names can be as long as 255 bytes and can exist in several
> > > different
> > > +namespaces.
> > > +Values can be as large as 64KB.
> > > +A file's extended attributes are stored in blocks mapped by the
> > > attr
> > > fork.
> > > +The mappings point to leaf blocks, remote value blocks, or
> > > dabtree
> > > blocks.
> > > +Block 0 in the attribute fork is always the top of the
> > > structure,
> > > but otherwise
> > > +each of the three types of blocks can be found at any offset in
> > > the
> > > attr fork.
> > > +Leaf blocks contain attribute key records that point to the name
> > > and
> > > the value.
> > > +Names are always stored elsewhere in the same leaf block.
> > > +Values that are less than 3/4 the size of a filesystem block are
> > > also stored
> > > +elsewhere in the same leaf block.
> > > +Remote value blocks contain values that are too large to fit
> > > inside
> > > a leaf.
> > > +If the leaf information exceeds a single filesystem block, a
> > > dabtree
> > > (also
> > > +rooted at block 0) is created to map hashes of the attribute
> > > names
> > > to leaf
> > > +blocks in the attr fork.
> > > +
> > > +Checking an extended attribute structure is not so
> > > straightfoward
> > > due to the
> > > +lack of separation between attr blocks and index blocks.
> > > +Scrub must read each block mapped by the attr fork and ignore
> > > the
> > > non-leaf
> > > +blocks:
> > > +
> > > +1. Walk the dabtree in the attr fork (if present) to ensure that
> > > there are no
> > > +   irregularities in the blocks or dabtree mappings that do not
> > > point to
> > > +   attr leaf blocks.
> > > +
> > > +2. Walk the blocks of the attr fork looking for leaf blocks.
> > > +   For each entry inside a leaf:
> > > +
> > > +   a. Validate that the name does not contain invalid
> > > characters.
> > > +
> > > +   b. Read the attr value.
> > > +      This performs a named lookup of the attr name to ensure
> > > the
> > > correctness
> > > +      of the dabtree.
> > > +      If the value is stored in a remote block, this also
> > > validates
> > > the
> > > +      integrity of the remote value block.
> > > +
> > > +Checking and Cross-Referencing Directories
> > > +``````````````````````````````````````````
> > > +
> > > +The filesystem directory tree is a directed acylic graph
> > > structure,
> > > with files
> > > +constituting the nodes, and directory entries (dirents)
> > > constituting
> > > the edges.
> > > +Directories are a special type of file containing a set of
> > > mappings
> > > from a
> > > +255-byte sequence (name) to an inumber.
> > > +These are called directory entries, or dirents for short.
> > > +Each directory file must have exactly one directory pointing to
> > > the
> > > file.
> > > +A root directory points to itself.
> > > +Directory entries point to files of any type.
> > > +Each non-directory file may have multiple directories point to
> > > it.
> > > +
> > > +In XFS, directories are implemented as a file containing up to
> > > three
> > > 32GB
> > > +partitions.
> > > +The first partition contains directory entry data blocks.
> > > +Each data block contains variable-sized records associating a
> > > user-
> > > provided
> > > +name with an inumber and, optionally, a file type.
> > > +If the directory entry data grows beyond one block, the second
> > > partition (which
> > > +exists as post-EOF extents) is populated with a block containing
> > > free space
> > > +information and an index that maps hashes of the dirent names to
> > > directory data
> > > +blocks in the first partition.
> > > +This makes directory name lookups very fast.
> > > +If this second partition grows beyond one block, the third
> > > partition
> > > is
> > > +populated with a linear array of free space information for
> > > faster
> > > +expansions.
> > > +If the free space has been separated and the second partition
> > > grows
> > > again
> > > +beyond one block, then a dabtree is used to map hashes of dirent
> > > names to
> > > +directory data blocks.
> > > +
> > > +Checking a directory is pretty straightfoward:
> > > +
> > > +1. Walk the dabtree in the second partition (if present) to
> > > ensure
> > > that there
> > > +   are no irregularities in the blocks or dabtree mappings that
> > > do
> > > not point to
> > > +   dirent blocks.
> > > +
> > > +2. Walk the blocks of the first partition looking for directory
> > > entries.
> > > +   Each dirent is checked as follows:
> > > +
> > > +   a. Does the name contain no invalid characters?
> > > +
> > > +   b. Does the inumber correspond to an actual, allocated inode?
> > > +
> > > +   c. Does the child inode have a nonzero link count?
> > > +
> > > +   d. If a file type is included in the dirent, does it match
> > > the
> > > type of the
> > > +      inode?
> > > +
> > > +   e. If the child is a subdirectory, does the child's dotdot
> > > pointer point
> > > +      back to the parent?
> > > +
> > > +   f. If the directory has a second partition, perform a named
> > > lookup of the
> > > +      dirent name to ensure the correctness of the dabtree.
> > > +
> > > +3. Walk the free space list in the third partition (if present)
> > > to
> > > ensure that
> > > +   the free spaces it describes are really unused.
> > > +
> > > +Checking operations involving :ref:`parents <dirparent>` and
> > > +:ref:`file link counts <nlinks>` are discussed in more detail in
> > > later
> > > +sections.
> > > +
> > > +Checking Directory/Attribute Btrees
> > > +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > > +
> > > +As stated in previous sections, the directory/attribute btree
> > > (dabtree) index
> > > +maps user-provided names to improve lookup times by avoiding
> > > linear
> > > scans.
> > > +Internally, it maps a 32-bit hash of the name to a block offset
> > > within the
> > > +appropriate file fork.
> > > +
> > > +The internal structure of a dabtree closely resembles the btrees
> > > that record
> > > +fixed-size metadata records -- each dabtree block contains a
> > > magic
> > > number, a
> > > +checksum, sibling pointers, a UUID, a tree level, and a log
> > > sequence
> > > number.
> > > +The format of leaf and node records are the same -- each entry
> > > points to the
> > > +next level down in the hierarchy, with dabtree node records
> > > pointing
> > > to dabtree
> > > +leaf blocks, and dabtree leaf records pointing to non-dabtree
> > > blocks
> > > elsewhere
> > > +in the fork.
> > > +
> > > +Checking and cross-referencing the dabtree is very similar to
> > > what
> > > is done for
> > > +space btrees:
> > > +
> > > +- Does the type of data stored in the block match what scrub is
> > > expecting?
> > > +
> > > +- Does the block belong to the owning structure that asked for
> > > the
> > > read?
> > > +
> > > +- Do the records fit within the block?
> > > +
> > > +- Are the records contained inside the block free of obvious
> > > corruptions?
> > > +
> > > +- Are the name hashes in the correct order?
> > > +
> > > +- Do node pointers within the dabtree point to valid fork
> > > offsets
> > > for dabtree
> > > +  blocks?
> > > +
> > > +- Do leaf pointers within the dabtree point to valid fork
> > > offsets
> > > for directory
> > > +  or attr leaf blocks?
> > > +
> > > +- Do child pointers point towards the leaves?
> > > +
> > > +- Do sibling pointers point across the same level?
> > > +
> > > +- For each dabtree node record, does the record key accurate
> > > reflect
> > > the
> > > +  contents of the child dabtree block?
> > > +
> > > +- For each dabtree leaf record, does the record key accurate
> > > reflect
> > > the
> > > +  contents of the directory or attr block?
> > > +
> > > +Cross-Referencing Summary Counters
> > > +``````````````````````````````````
> > > +
> > > +XFS maintains three classes of summary counters: available
> > > resources, quota
> > > +resource usage, and file link counts.
> > > +
> > > +In theory, the amount of available resources (data blocks,
> > > inodes,
> > > realtime
> > > +extents) can be found by walking the entire filesystem.
> > > +This would make for very slow reporting, so a transactional
> > > filesystem can
> > > +maintain summaries of this information in the superblock.
> > > +Cross-referencing these values against the filesystem metadata
> > > should be a
> > > +simple matter of walking the free space and inode metadata in
> > > each
> > > AG and the
> > > +realtime bitmap, but there are complications that will be
> > > discussed
> > > in
> > > +:ref:`more detail <fscounters>` later.
> > > +
> > > +:ref:`Quota usage <quotacheck>` and :ref:`file link count
> > > <nlinks>`
> > > +checking are sufficiently complicated to warrant separate
> > > sections.
> > > +
> > > +Post-Repair Reverification
> > > +``````````````````````````
> > > +
> > > +After performing a repair, the checking code is run a second
> > > time to
> > > validate
> > > +the new structure, and the results of the health assessment are
> > > recorded
> > > +internally and returned to the calling process.
> > > +This step is critical for enabling system administrator to
> > > monitor
> > > the status
> > > +of the filesystem and the progress of any repairs.
> > > +For developers, it is a useful means to judge the efficacy of
> > > error
> > > detection
> > > +and correction in the online and offline checking tools.
> > > diff --git a/Documentation/filesystems/xfs-self-describing-
> > > metadata.rst b/Documentation/filesystems/xfs-self-describing-
> > > metadata.rst
> > > index b79dbf36dc94..a10c4ae6955e 100644
> > > --- a/Documentation/filesystems/xfs-self-describing-metadata.rst
> > > +++ b/Documentation/filesystems/xfs-self-describing-metadata.rst
> > > @@ -1,4 +1,5 @@
> > >  .. SPDX-License-Identifier: GPL-2.0
> > > +.. _xfs_self_describing_metadata:
> > >  
> > >  ============================
> > >  XFS Self Describing Metadata
> > > 
> >
diff mbox series

Patch

diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst b/Documentation/filesystems/xfs-online-fsck-design.rst
index 42e82971e036..f45bf97fa9c4 100644
--- a/Documentation/filesystems/xfs-online-fsck-design.rst
+++ b/Documentation/filesystems/xfs-online-fsck-design.rst
@@ -864,3 +864,582 @@  Proposed patchsets include
 and
 `preservation of sickness info during memory reclaim
 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=indirect-health-reporting>`_.
+
+5. Kernel Algorithms and Data Structures
+========================================
+
+This section discusses the key algorithms and data structures of the kernel
+code that provide the ability to check and repair metadata while the system
+is running.
+The first chapters in this section reveal the pieces that provide the
+foundation for checking metadata.
+The remainder of this section presents the mechanisms through which XFS
+regenerates itself.
+
+Self Describing Metadata
+------------------------
+
+Starting with XFS version 5 in 2012, XFS updated the format of nearly every
+ondisk block header to record a magic number, a checksum, a universally
+"unique" identifier (UUID), an owner code, the ondisk address of the block,
+and a log sequence number.
+When loading a block buffer from disk, the magic number, UUID, owner, and
+ondisk address confirm that the retrieved block matches the specific owner of
+the current filesystem, and that the information contained in the block is
+supposed to be found at the ondisk address.
+The first three components enable checking tools to disregard alleged metadata
+that doesn't belong to the filesystem, and the fourth component enables the
+filesystem to detect lost writes.
+
+The logging code maintains the checksum and the log sequence number of the last
+transactional update.
+Checksums are useful for detecting torn writes and other mischief between the
+computer and its storage devices.
+Sequence number tracking enables log recovery to avoid applying out of date
+log updates to the filesystem.
+
+These two features improve overall runtime resiliency by providing a means for
+the filesystem to detect obvious corruption when reading metadata blocks from
+disk, but these buffer verifiers cannot provide any consistency checking
+between metadata structures.
+
+For more information, please see the documentation for
+Documentation/filesystems/xfs-self-describing-metadata.rst
+
+Reverse Mapping
+---------------
+
+The original design of XFS (circa 1993) is an improvement upon 1980s Unix
+filesystem design.
+In those days, storage density was expensive, CPU time was scarce, and
+excessive seek time could kill performance.
+For performance reasons, filesystem authors were reluctant to add redundancy to
+the filesystem, even at the cost of data integrity.
+Filesystems designers in the early 21st century choose different strategies to
+increase internal redundancy -- either storing nearly identical copies of
+metadata, or more space-efficient techniques such as erasure coding.
+Obvious corruptions are typically repaired by copying replicas or
+reconstructing from codes.
+
+For XFS, a different redundancy strategy was chosen to modernize the design:
+a secondary space usage index that maps allocated disk extents back to their
+owners.
+By adding a new index, the filesystem retains most of its ability to scale
+well to heavily threaded workloads involving large datasets, since the primary
+file metadata (the directory tree, the file block map, and the allocation
+groups) remain unchanged.
+Although the reverse-mapping feature increases overhead costs for space
+mapping activities just like any other system that improves redundancy, it
+has two critical advantages: first, the reverse index is key to enabling online
+fsck and other requested functionality such as filesystem reorganization,
+better media failure reporting, and shrinking.
+Second, the different ondisk storage format of the reverse mapping btree
+defeats device-level deduplication, because the filesystem requires real
+redundancy.
+
+A criticism of adding the secondary index is that it does nothing to improve
+the robustness of user data storage itself.
+This is a valid point, but adding a new index for file data block checksums
+increases write amplification and turns data overwrites into copy-writes, which
+age the filesystem prematurely.
+In keeping with thirty years of precedent, users who want file data integrity
+can supply as powerful a solution as they require.
+As for metadata, the complexity of adding a new secondary index of space usage
+is much less than adding volume management and storage device mirroring to XFS
+itself.
+Perfection of RAID and volume management are best left to existing layers in
+the kernel.
+
+The information captured in a reverse space mapping record is as follows:
+
+.. code-block:: c
+
+	struct xfs_rmap_irec {
+	    xfs_agblock_t    rm_startblock;   /* extent start block */
+	    xfs_extlen_t     rm_blockcount;   /* extent length */
+	    uint64_t         rm_owner;        /* extent owner */
+	    uint64_t         rm_offset;       /* offset within the owner */
+	    unsigned int     rm_flags;        /* state flags */
+	};
+
+The first two fields capture the location and size of the physical space,
+in units of filesystem blocks.
+The owner field tells scrub which metadata structure or file inode have been
+assigned this space.
+For space allocated to files, the offset field tells scrub where the space was
+mapped within the file fork.
+Finally, the flags field provides extra information about the space usage --
+is this an attribute fork extent?  A file mapping btree extent?  Or an
+unwritten data extent?
+
+Online filesystem checking judges the consistency of each primary metadata
+record by comparing its information against all other space indices.
+The reverse mapping index plays a key role in the consistency checking process
+because it contains a centralized alternate copy of all space allocation
+information.
+Program runtime and ease of resource acquisition are the only real limits to
+what online checking can consult.
+For example, a file data extent mapping can be checked against:
+
+* The absence of an entry in the free space information.
+* The absence of an entry in the inode index.
+* The absence of an entry in the reference count data if the file is not
+  marked as having shared extents.
+* The correspondence of an entry in the reverse mapping information.
+
+A key observation here is that only the reverse mapping can provide a positive
+affirmation of correctness if the primary metadata is in doubt.
+The checking code for most primary metadata follows a path similar to the
+one outlined above.
+
+A second observation to make about this secondary index is that proving its
+consistency with the primary metadata is difficult.
+Demonstrating that a given reverse mapping record exactly corresponds to the
+primary space metadata involves a full scan of all primary space metadata,
+which is very time intensive.
+Scanning activity for online fsck can only use non-blocking lock acquisition
+primitives if the locking order is not the regular order as used by the rest of
+the filesystem.
+This means that forward progress during this part of a scan of the reverse
+mapping data cannot be guaranteed if system load is especially heavy.
+Therefore, it is not practical for online check to detect reverse mapping
+records that lack a counterpart in the primary metadata.
+Instead, scrub relies on rigorous cross-referencing during the primary space
+mapping structure checks.
+
+Reverse mappings also play a key role in reconstruction of primary metadata.
+The secondary information is general enough for online repair to synthesize a
+complete copy of any primary space management metadata by locking that
+resource, querying all reverse mapping indices looking for records matching
+the relevant resource, and transforming the mapping into an appropriate format.
+The details of how these records are staged, written to disk, and committed
+into the filesystem are covered in subsequent sections.
+
+Checking and Cross-Referencing
+------------------------------
+
+The first step of checking a metadata structure is to examine every record
+contained within the structure and its relationship with the rest of the
+system.
+XFS contains multiple layers of checking to try to prevent inconsistent
+metadata from wreaking havoc on the system.
+Each of these layers contributes information that helps the kernel to make
+three decisions about the health of a metadata structure:
+
+- Is a part of this structure obviously corrupt (``XFS_SCRUB_OFLAG_CORRUPT``) ?
+- Is this structure inconsistent with the rest of the system
+  (``XFS_SCRUB_OFLAG_XCORRUPT``) ?
+- Is there so much damage around the filesystem that cross-referencing is not
+  possible (``XFS_SCRUB_OFLAG_XFAIL``) ?
+- Can the structure be optimized to improve performance or reduce the size of
+  metadata (``XFS_SCRUB_OFLAG_PREEN``) ?
+- Does the structure contain data that is not inconsistent but deserves review
+  by the system administrator (``XFS_SCRUB_OFLAG_WARNING``) ?
+
+The following sections describe how the metadata scrubbing process works.
+
+Metadata Buffer Verification
+````````````````````````````
+
+The lowest layer of metadata protection in XFS are the metadata verifiers built
+into the buffer cache.
+These functions perform inexpensive internal consistency checking of the block
+itself, and answer these questions:
+
+- Does the block belong to this filesystem?
+
+- Does the block belong to the structure that asked for the read?
+  This assumes that metadata blocks only have one owner, which is always true
+  in XFS.
+
+- Is the type of data stored in the block within a reasonable range of what
+  scrub is expecting?
+
+- Does the physical location of the block match the location it was read from?
+
+- Does the block checksum match the data?
+
+The scope of the protections here are very limited -- verifiers can only
+establish that the filesystem code is reasonably free of gross corruption bugs
+and that the storage system is reasonably competent at retrieval.
+Corruption problems observed at runtime cause the generation of health reports,
+failed system calls, and in the extreme case, filesystem shutdowns if the
+corrupt metadata force the cancellation of a dirty transaction.
+
+Every online fsck scrubbing function is expected to read every ondisk metadata
+block of a structure in the course of checking the structure.
+Corruption problems observed during a check are immediately reported to
+userspace as corruption; during a cross-reference, they are reported as a
+failure to cross-reference once the full examination is complete.
+Reads satisfied by a buffer already in cache (and hence already verified)
+bypass these checks.
+
+Internal Consistency Checks
+```````````````````````````
+
+The next higher level of metadata protection is the internal record
+verification code built into the filesystem.
+These checks are split between the buffer verifiers, the in-filesystem users of
+the buffer cache, and the scrub code itself, depending on the amount of higher
+level context required.
+The scope of checking is still internal to the block.
+For performance reasons, regular code may skip some of these checks unless
+debugging is enabled or a write is about to occur.
+Scrub functions, of course, must check all possible problems.
+Either way, these higher level checking functions answer these questions:
+
+- Does the type of data stored in the block match what scrub is expecting?
+
+- Does the block belong to the owning structure that asked for the read?
+
+- If the block contains records, do the records fit within the block?
+
+- If the block tracks internal free space information, is it consistent with
+  the record areas?
+
+- Are the records contained inside the block free of obvious corruptions?
+
+Record checks in this category are more rigorous and more time-intensive.
+For example, block pointers and inumbers are checked to ensure that they point
+within the dynamically allocated parts of an allocation group and within
+the filesystem.
+Names are checked for invalid characters, and flags are checked for invalid
+combinations.
+Other record attributes are checked for sensible values.
+Btree records spanning an interval of the btree keyspace are checked for
+correct order and lack of mergeability (except for file fork mappings).
+
+Validation of Userspace-Controlled Record Attributes
+````````````````````````````````````````````````````
+
+Various pieces of filesystem metadata are directly controlled by userspace.
+Because of this nature, validation work cannot be more precise than checking
+that a value is within the possible range.
+These fields include:
+
+- Superblock fields controlled by mount options
+- Filesystem labels
+- File timestamps
+- File permissions
+- File size
+- File flags
+- Names present in directory entries, extended attribute keys, and filesystem
+  labels
+- Extended attribute key namespaces
+- Extended attribute values
+- File data block contents
+- Quota limits
+- Quota timer expiration (if resource usage exceeds the soft limit)
+
+Cross-Referencing Space Metadata
+````````````````````````````````
+
+The next higher level of checking is cross-referencing records between metadata
+structures.
+For regular runtime code, the cost of these checks is considered to be
+prohibitively expensive, but as scrub is dedicated to rooting out
+inconsistencies, it must pursue all avenues of inquiry.
+The exact set of cross-referencing is highly dependent on the context of the
+data structure being checked.
+
+The XFS btree code has keyspace scanning functions that online fsck uses to
+cross reference one structure with another.
+Specifically, scrub can scan the key space of an index to determine if that
+keyspace is fully, sparsely, or not at all mapped to records.
+For the reverse mapping btree, it is possible to mask parts of the key for the
+purposes of performing a keyspace scan so that scrub can decide if the rmap
+btree contains records mapping a certain extent of physical space without the
+sparsenses of the rest of the rmap keyspace getting in the way.
+
+Btree blocks undergo the following checks before cross-referencing:
+
+- Does the type of data stored in the block match what scrub is expecting?
+
+- Does the block belong to the owning structure that asked for the read?
+
+- Do the records fit within the block?
+
+- Are the records contained inside the block free of obvious corruptions?
+
+- Are the name hashes in the correct order?
+
+- Do node pointers within the btree point to valid block addresses for the type
+  of btree?
+
+- Do child pointers point towards the leaves?
+
+- Do sibling pointers point across the same level?
+
+- For each node block record, does the record key accurate reflect the contents
+  of the child block?
+
+Space allocation records are cross-referenced as follows:
+
+1. Any space mentioned by any metadata structure are cross-referenced as
+   follows:
+
+   - Does the reverse mapping index list only the appropriate owner as the
+     owner of each block?
+
+   - Are none of the blocks claimed as free space?
+
+   - If these aren't file data blocks, are none of the blocks claimed as space
+     shared by different owners?
+
+2. Btree blocks are cross-referenced as follows:
+
+   - Everything in class 1 above.
+
+   - If there's a parent node block, do the keys listed for this block match the
+     keyspace of this block?
+
+   - Do the sibling pointers point to valid blocks?  Of the same level?
+
+   - Do the child pointers point to valid blocks?  Of the next level down?
+
+3. Free space btree records are cross-referenced as follows:
+
+   - Everything in class 1 and 2 above.
+
+   - Does the reverse mapping index list no owners of this space?
+
+   - Is this space not claimed by the inode index for inodes?
+
+   - Is it not mentioned by the reference count index?
+
+   - Is there a matching record in the other free space btree?
+
+4. Inode btree records are cross-referenced as follows:
+
+   - Everything in class 1 and 2 above.
+
+   - Is there a matching record in free inode btree?
+
+   - Do cleared bits in the holemask correspond with inode clusters?
+
+   - Do set bits in the freemask correspond with inode records with zero link
+     count?
+
+5. Inode records are cross-referenced as follows:
+
+   - Everything in class 1.
+
+   - Do all the fields that summarize information about the file forks actually
+     match those forks?
+
+   - Does each inode with zero link count correspond to a record in the free
+     inode btree?
+
+6. File fork space mapping records are cross-referenced as follows:
+
+   - Everything in class 1 and 2 above.
+
+   - Is this space not mentioned by the inode btrees?
+
+   - If this is a CoW fork mapping, does it correspond to a CoW entry in the
+     reference count btree?
+
+7. Reference count records are cross-referenced as follows:
+
+   - Everything in class 1 and 2 above.
+
+   - Within the space subkeyspace of the rmap btree (that is to say, all
+     records mapped to a particular space extent and ignoring the owner info),
+     are there the same number of reverse mapping records for each block as the
+     reference count record claims?
+
+Proposed patchsets are the series to find gaps in
+`refcount btree
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-refcount-gaps>`_,
+`inode btree
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-inobt-gaps>`_, and
+`rmap btree
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-rmapbt-gaps>`_ records;
+to find
+`mergeable records
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-mergeable-records>`_;
+and to
+`improve cross referencing with rmap
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-strengthen-rmap-checking>`_
+before starting a repair.
+
+Checking Extended Attributes
+````````````````````````````
+
+Extended attributes implement a key-value store that enable fragments of data
+to be attached to any file.
+Both the kernel and userspace can access the keys and values, subject to
+namespace and privilege restrictions.
+Most typically these fragments are metadata about the file -- origins, security
+contexts, user-supplied labels, indexing information, etc.
+
+Names can be as long as 255 bytes and can exist in several different
+namespaces.
+Values can be as large as 64KB.
+A file's extended attributes are stored in blocks mapped by the attr fork.
+The mappings point to leaf blocks, remote value blocks, or dabtree blocks.
+Block 0 in the attribute fork is always the top of the structure, but otherwise
+each of the three types of blocks can be found at any offset in the attr fork.
+Leaf blocks contain attribute key records that point to the name and the value.
+Names are always stored elsewhere in the same leaf block.
+Values that are less than 3/4 the size of a filesystem block are also stored
+elsewhere in the same leaf block.
+Remote value blocks contain values that are too large to fit inside a leaf.
+If the leaf information exceeds a single filesystem block, a dabtree (also
+rooted at block 0) is created to map hashes of the attribute names to leaf
+blocks in the attr fork.
+
+Checking an extended attribute structure is not so straightfoward due to the
+lack of separation between attr blocks and index blocks.
+Scrub must read each block mapped by the attr fork and ignore the non-leaf
+blocks:
+
+1. Walk the dabtree in the attr fork (if present) to ensure that there are no
+   irregularities in the blocks or dabtree mappings that do not point to
+   attr leaf blocks.
+
+2. Walk the blocks of the attr fork looking for leaf blocks.
+   For each entry inside a leaf:
+
+   a. Validate that the name does not contain invalid characters.
+
+   b. Read the attr value.
+      This performs a named lookup of the attr name to ensure the correctness
+      of the dabtree.
+      If the value is stored in a remote block, this also validates the
+      integrity of the remote value block.
+
+Checking and Cross-Referencing Directories
+``````````````````````````````````````````
+
+The filesystem directory tree is a directed acylic graph structure, with files
+constituting the nodes, and directory entries (dirents) constituting the edges.
+Directories are a special type of file containing a set of mappings from a
+255-byte sequence (name) to an inumber.
+These are called directory entries, or dirents for short.
+Each directory file must have exactly one directory pointing to the file.
+A root directory points to itself.
+Directory entries point to files of any type.
+Each non-directory file may have multiple directories point to it.
+
+In XFS, directories are implemented as a file containing up to three 32GB
+partitions.
+The first partition contains directory entry data blocks.
+Each data block contains variable-sized records associating a user-provided
+name with an inumber and, optionally, a file type.
+If the directory entry data grows beyond one block, the second partition (which
+exists as post-EOF extents) is populated with a block containing free space
+information and an index that maps hashes of the dirent names to directory data
+blocks in the first partition.
+This makes directory name lookups very fast.
+If this second partition grows beyond one block, the third partition is
+populated with a linear array of free space information for faster
+expansions.
+If the free space has been separated and the second partition grows again
+beyond one block, then a dabtree is used to map hashes of dirent names to
+directory data blocks.
+
+Checking a directory is pretty straightfoward:
+
+1. Walk the dabtree in the second partition (if present) to ensure that there
+   are no irregularities in the blocks or dabtree mappings that do not point to
+   dirent blocks.
+
+2. Walk the blocks of the first partition looking for directory entries.
+   Each dirent is checked as follows:
+
+   a. Does the name contain no invalid characters?
+
+   b. Does the inumber correspond to an actual, allocated inode?
+
+   c. Does the child inode have a nonzero link count?
+
+   d. If a file type is included in the dirent, does it match the type of the
+      inode?
+
+   e. If the child is a subdirectory, does the child's dotdot pointer point
+      back to the parent?
+
+   f. If the directory has a second partition, perform a named lookup of the
+      dirent name to ensure the correctness of the dabtree.
+
+3. Walk the free space list in the third partition (if present) to ensure that
+   the free spaces it describes are really unused.
+
+Checking operations involving :ref:`parents <dirparent>` and
+:ref:`file link counts <nlinks>` are discussed in more detail in later
+sections.
+
+Checking Directory/Attribute Btrees
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+As stated in previous sections, the directory/attribute btree (dabtree) index
+maps user-provided names to improve lookup times by avoiding linear scans.
+Internally, it maps a 32-bit hash of the name to a block offset within the
+appropriate file fork.
+
+The internal structure of a dabtree closely resembles the btrees that record
+fixed-size metadata records -- each dabtree block contains a magic number, a
+checksum, sibling pointers, a UUID, a tree level, and a log sequence number.
+The format of leaf and node records are the same -- each entry points to the
+next level down in the hierarchy, with dabtree node records pointing to dabtree
+leaf blocks, and dabtree leaf records pointing to non-dabtree blocks elsewhere
+in the fork.
+
+Checking and cross-referencing the dabtree is very similar to what is done for
+space btrees:
+
+- Does the type of data stored in the block match what scrub is expecting?
+
+- Does the block belong to the owning structure that asked for the read?
+
+- Do the records fit within the block?
+
+- Are the records contained inside the block free of obvious corruptions?
+
+- Are the name hashes in the correct order?
+
+- Do node pointers within the dabtree point to valid fork offsets for dabtree
+  blocks?
+
+- Do leaf pointers within the dabtree point to valid fork offsets for directory
+  or attr leaf blocks?
+
+- Do child pointers point towards the leaves?
+
+- Do sibling pointers point across the same level?
+
+- For each dabtree node record, does the record key accurate reflect the
+  contents of the child dabtree block?
+
+- For each dabtree leaf record, does the record key accurate reflect the
+  contents of the directory or attr block?
+
+Cross-Referencing Summary Counters
+``````````````````````````````````
+
+XFS maintains three classes of summary counters: available resources, quota
+resource usage, and file link counts.
+
+In theory, the amount of available resources (data blocks, inodes, realtime
+extents) can be found by walking the entire filesystem.
+This would make for very slow reporting, so a transactional filesystem can
+maintain summaries of this information in the superblock.
+Cross-referencing these values against the filesystem metadata should be a
+simple matter of walking the free space and inode metadata in each AG and the
+realtime bitmap, but there are complications that will be discussed in
+:ref:`more detail <fscounters>` later.
+
+:ref:`Quota usage <quotacheck>` and :ref:`file link count <nlinks>`
+checking are sufficiently complicated to warrant separate sections.
+
+Post-Repair Reverification
+``````````````````````````
+
+After performing a repair, the checking code is run a second time to validate
+the new structure, and the results of the health assessment are recorded
+internally and returned to the calling process.
+This step is critical for enabling system administrator to monitor the status
+of the filesystem and the progress of any repairs.
+For developers, it is a useful means to judge the efficacy of error detection
+and correction in the online and offline checking tools.
diff --git a/Documentation/filesystems/xfs-self-describing-metadata.rst b/Documentation/filesystems/xfs-self-describing-metadata.rst
index b79dbf36dc94..a10c4ae6955e 100644
--- a/Documentation/filesystems/xfs-self-describing-metadata.rst
+++ b/Documentation/filesystems/xfs-self-describing-metadata.rst
@@ -1,4 +1,5 @@ 
 .. SPDX-License-Identifier: GPL-2.0
+.. _xfs_self_describing_metadata:
 
 ============================
 XFS Self Describing Metadata