diff mbox series

docs/mm: add VMA locks documentation

Message ID 20241107190137.58000-1-lorenzo.stoakes@oracle.com (mailing list archive)
State New
Headers show
Series docs/mm: add VMA locks documentation | expand

Commit Message

Lorenzo Stoakes Nov. 7, 2024, 7:01 p.m. UTC
Andrew - As mm-specific docs were brought under the mm tree in a recent
change to MAINTAINERS I believe this ought to go through your tree?


Locking around VMAs is complicated and confusing. While we have a number of
disparate comments scattered around the place, we seem to be reaching a
level of complexity that justifies a serious effort at clearly documenting
how locks are expected to be used when it comes to interacting with
mm_struct and vm_area_struct objects.

This is especially pertinent as regards the efforts to find sensible
abstractions for these fundamental objects in kernel rust code whose
compiler strictly requires some means of expressing these rules (and
through this expression, self-document these requirements as well as
enforce them).

The document limits scope to mmap and VMA locks and those that are
immediately adjacent and relevant to them - so additionally covers page
table locking as this is so very closely tied to VMA operations (and relies
upon us handling these correctly).

The document tries to cover some of the nastier and more confusing edge
cases and concerns especially around lock ordering and page table teardown.

The document is split between generally useful information for users of mm
interfaces, and separately a section intended for mm kernel developers
providing a discussion around internal implementation details.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---

REVIEWERS NOTES:
* Apologies if I missed any feedback, I believe I have taken everything
  into account but do let me know if I missed anything.
* As before, for convenience, I've uploaded a render of this document to my
  website at https://ljs.io/output/mm/process_addrs
* You can speed up doc builds by running `make SPHINXDIRS=mm htmldocs`.

v1:
* Removed RFC tag as I think we are iterating towards something workable
  and there is interest.
* Cleaned up and sharpened the language, structure and layout. Separated
  into top-level details and implementation sections as per Alice.
* Replaced links with rather more readable formatting.
* Improved valid mmap/VMA lock state table.
* Put VMA locks section into the process addresses document as per SJ and
  Mike.
* Made clear as to read/write operations against VMA object rather than
  userland memory, as per Mike's suggestion, also that it does not refer to
  page tables as per Jann.
* Moved note into main section as per Mike's suggestion.
* Fixed grammar mistake as per Mike.
* Converted list-table to table as per Mike.
* Corrected various typos as per Jann, Suren.
* Updated reference to page fault arches as per Jann.
* Corrected mistaken write lock criteria for vm_lock_seq as per Jann.
* Updated vm_pgoff description to reference CONFIG_ARCH_HAS_PTE_SPECIAL as
  per Jann.
* Updated write lock to mmap read for vma->numab_state as per Jann.
* Clarified that the write lock is on the mmap and VMA lock at VMA
  granularity earlier in description as per Suren.
* Added explicit note at top of VMA lock section to explicitly highlight
  VMA lock semantics as per Suren.
* Updated required locking for vma lock fields to N/A to avoid confusion as
  per Suren.
* Corrected description of mmap_downgrade() as per Suren.
* Added a note on gate VMAs as per Jann.
* Explained that taking mmap read lock under VMA lock is a bad idea due to
  deadlock as per Jann.
* Discussed atomicity in page table operations as per Jann.
* Adapted the well thought out page table locking rules as provided by Jann.
* Added a comment about pte mapping maintaining an RCU read lock.
* Added clarification on moving page tables as informed by Jann's comments
  (though it turns out mremap() doesn't necessarily hold all locks if it
  can resolve races other ways :)
* Added Jann's diagram showing lock exclusivity characteristics.

RFC:
https://lore.kernel.org/all/20241101185033.131880-1-lorenzo.stoakes@oracle.com/

 Documentation/mm/process_addrs.rst | 678 +++++++++++++++++++++++++++++
 1 file changed, 678 insertions(+)

--
2.47.0

Comments

Jann Horn Nov. 7, 2024, 9:15 p.m. UTC | #1
On Thu, Nov 7, 2024 at 8:02 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
> Locking around VMAs is complicated and confusing. While we have a number of
> disparate comments scattered around the place, we seem to be reaching a
> level of complexity that justifies a serious effort at clearly documenting
> how locks are expected to be used when it comes to interacting with
> mm_struct and vm_area_struct objects.

Thanks, I think this is looking pretty good now.

> +VMA fields
> +^^^^^^^^^^
> +
> +We can subdivide :c:struct:`!struct vm_area_struct` fields by their purpose, which makes it
> +easier to explore their locking characteristics:
> +
> +.. note:: We exclude VMA lock-specific fields here to avoid confusion, as these
> +         are in effect an internal implementation detail.
> +
> +.. table:: Virtual layout fields
> +
> +   ===================== ======================================== ===========
> +   Field                 Description                              Write lock
> +   ===================== ======================================== ===========
[...]
> +   :c:member:`!vm_pgoff` Describes the page offset into the file, rmap write.
> +                         the original page offset within the      mmap write,
> +                         virtual address space (prior to any      rmap write.
> +                         :c:func:`!mremap`), or PFN if a PFN map
> +                         and the architecture does not support
> +                         :c:macro:`!CONFIG_ARCH_HAS_PTE_SPECIAL`.

Is that a typo in the rightmost column? "rmap write. mmap write, rmap
write." lists rmap twice

> +   ===================== ======================================== ===========
> +
> +These fields describes the size, start and end of the VMA, and as such cannot be

s/describes/describe/

> +modified without first being hidden from the reverse mapping since these fields
> +are used to locate VMAs within the reverse mapping interval trees.
[...]
> +.. table:: Reverse mapping fields
> +
> +   =================================== ========================================= ================
> +   Field                               Description                               Write lock
> +   =================================== ========================================= ================
> +   :c:member:`!shared.rb`              A red/black tree node used, if the        mmap write,
> +                                       mapping is file-backed, to place the VMA  VMA write,
> +                                       in the                                    i_mmap write.
> +                                       :c:member:`!struct address_space->i_mmap`
> +                                      red/black interval tree.

This list of write locks is correct regarding what locks you need to
make changes to the VMA's tree membership. Technically at a lower
level, the contents of vma->shared.rb are written while holding only
the file rmap lock when the surrounding rbtree nodes change, but
that's because the rbtree basically takes ownership of the node once
it's been inserted into the tree. But I don't know how to concisely
express that here, and it's kind of a nitpicky detail, so I think the
current version looks good.

Maybe you could add "For changing the VMA's association with the
rbtree:" on top of the list of write locks for this one?

> +   :c:member:`!shared.rb_subtree_last` Metadata used for management of the
> +                                       interval tree if the VMA is file-backed.  mmap write,
> +                                                                                 VMA write,
> +                                                                                 i_mmap write.

For this one, I think it would be more accurate to say that it is
written under just the i_mmap write lock. Though maybe that looks
kinda inconsistent with the previous one...

> +   :c:member:`!anon_vma_chain`         List of links to forked/CoW’d anon_vma    mmap read,
> +                                       objects.                                  anon_vma write.

Technically not just forked/CoW'd ones, but also the current one.
Maybe something like "List of links to anon_vma objects (including
inherited ones) that anonymous pages can be associated with"?

> +   :c:member:`!anon_vma`               :c:type:`!anon_vma` object used by        mmap_read,
> +                                       anonymous folios mapped exclusively to    page_table_lock.
> +                                      this VMA.

move_vma() uses unlink_anon_vmas() to change ->anon_vma from non-NULL
to NULL. There we hold:

 - mmap lock (exclusive, from sys_mremap)
 - VMA write lock (from move_vma)
 - anon_vma lock (from unlink_anon_vmas)

So it's not true that we always hold the page_table_lock for this.

Should this maybe have two separate parts, one for "for changing NULL
-> non-NULL" and one for "changing non-NULL to NULL"? Where the
NULL->non-NULL scenario uses the locks you listed and non-NULL->NULL
relies on write-locking the VMA and the anon_vma?

> +   =================================== ========================================= ================
> +
> +These fields are used to both place the VMA within the reverse mapping, and for
> +anonymous mappings, to be able to access both related :c:struct:`!struct anon_vma` objects
> +and the :c:struct:`!struct anon_vma` which folios mapped exclusively to this VMA should

typo: s/which folios/in which folios/

> +reside.
> +
> +Page tables
> +-----------
> +
> +We won't speak exhaustively on the subject but broadly speaking, page tables map
> +virtual addresses to physical ones through a series of page tables, each of
> +which contain entries with physical addresses for the next page table level
> +(along with flags), and at the leaf level the physical addresses of the
> +underlying physical data pages (with offsets into these pages provided by the
> +virtual address itself).
> +
> +In Linux these are divided into five levels - PGD, P4D, PUD, PMD and PTE. Huge
> +pages might eliminate one or two of these levels, but when this is the case we
> +typically refer to the leaf level as the PTE level regardless.

(That last sentence doesn't match my headcanon but I also don't have
any reference for what is common Linux kernel phrasing around this so
this isn't really an actionable comment.)

> +.. note:: In instances where the architecture supports fewer page tables than
> +   five the kernel cleverly 'folds' page table levels, that is skips them within
> +   the logic, regardless we can act as if there were always five.
> +
> +There are three key operations typically performed on page tables:
> +
> +1. **Installing** page table mappings - whether creating a new mapping or
> +   modifying an existing one.
> +2. **Zapping/unmapping** page tables - This is what the kernel calls clearing page

bikeshedding, feel free to ignore:
Maybe "Zapping/unmapping page table entries"? At least that's how I
always read "zap_pte_range()" in my head - "zap page table entry
range". Though I don't know if that's the canonical interpretation.

> +   table mappings at the leaf level only, whilst leaving all page tables in
> +   place. This is a very common operation in the kernel performed on file
> +   truncation, the :c:macro:`!MADV_DONTNEED` operation via :c:func:`!madvise`,
> +   and others. This is performed by a number of functions including
> +   :c:func:`!unmap_mapping_range`, :c:func:`!unmap_mapping_pages` and reverse
> +   mapping logic.
[...]
> +Locking rules
> +^^^^^^^^^^^^^
> +
> +We establish basic locking rules when interacting with page tables:
> +
> +* When changing a page table entry the page table lock for that page table
> +  **must** be held.

(except, as you described below, in free_pgtables() when changing page
table entries pointing to lower-level page tables)

> +* Reads from and writes to page table entries must be appropriately atomic. See
> +  the section on atomicity below.
[...]
> +Page table installation
> +^^^^^^^^^^^^^^^^^^^^^^^
> +
> +When allocating a P4D, PUD or PMD and setting the relevant entry in the above
> +PGD, P4D or PUD, the :c:member:`!mm->page_table_lock` must be held. This is
> +acquired in :c:func:`!__p4d_alloc`, :c:func:`!__pud_alloc` and
> +:c:func:`!__pmd_alloc` respectively.
> +
> +.. note:: :c:func:`!__pmd_alloc` actually invokes :c:func:`!pud_lock` and
> +   :c:func:`!pud_lockptr` in turn, however at the time of writing it ultimately
> +   references the :c:member:`!mm->page_table_lock`.
> +
> +Allocating a PTE will either use the :c:member:`!mm->page_table_lock` or, if
> +:c:macro:`!USE_SPLIT_PMD_PTLOCKS` is defined, used a lock embedded in the PMD

typo: s/used/use/

> +physical page metadata in the form of a :c:struct:`!struct ptdesc`, acquired by
> +:c:func:`!pmd_ptdesc` called from :c:func:`!pmd_lock` and ultimately
> +:c:func:`!__pte_alloc`.
> +
> +Finally, modifying the contents of the PTE has special treatment, as this is a

nit: unclear what "this" refers to here - it looks like it refers to
"the PTE", but "the PTE is a lock" wouldn't make grammatical sense

> +lock that we must acquire whenever we want stable and exclusive access to
> +entries pointing to data pages within a PTE, especially when we wish to modify
> +them.

I don't think "entries pointing to data pages" need any more locking
than other entries, like swap entries or migration markers?

> +This is performed via :c:func:`!pte_offset_map_lock` which carefully checks to
> +ensure that the PTE hasn't changed from under us, ultimately invoking
> +:c:func:`!pte_lockptr` to obtain a spin lock at PTE granularity contained within
> +the :c:struct:`!struct ptdesc` associated with the physical PTE page. The lock
> +must be released via :c:func:`!pte_unmap_unlock`.
> +
> +.. note:: There are some variants on this, such as
> +   :c:func:`!pte_offset_map_rw_nolock` when we know we hold the PTE stable but
> +   for brevity we do not explore this.  See the comment for
> +   :c:func:`!__pte_offset_map_lock` for more details.
> +
> +When modifying data in ranges we typically only wish to allocate higher page
> +tables as necessary, using these locks to avoid races or overwriting anything,
> +and set/clear data at the PTE level as required (for instance when page faulting
> +or zapping).
[...]
> +Page table moving
> +^^^^^^^^^^^^^^^^^
> +
> +Some functions manipulate page table levels above PMD (that is PUD, P4D and PGD
> +page tables). Most notable of these is :c:func:`!mremap`, which is capable of
> +moving higher level page tables.
> +
> +In these instances, it is either required that **all** locks are taken, that is
> +the mmap lock, the VMA lock and the relevant rmap lock, or that the mmap lock
> +and VMA locks are taken and some other measure is taken to avoid rmap races (see
> +the comment in :c:func:`!move_ptes` in the :c:func:`!mremap` implementation for
> +details of how this is handled in this instance).

mremap() always takes the rmap locks when moving entire page tables,
and AFAIK that is necessary to avoid races that lead to TLB flushes
going to the wrong address. mremap() sometimes moves *leaf entries*
without holding rmap locks, but never entire tables.

move_pgt_entry() is confusingly written - need_rmap_locks is actually
always true in the NORMAL_* cases that move non-leaf entries.

> +You can observe that in the :c:func:`!mremap` implementation in the functions
> +:c:func:`!take_rmap_locks` and :c:func:`!drop_rmap_locks` which perform the rmap
> +side of lock acquisition, invoked ultimately by :c:func:`!move_page_tables`.
> +
> +VMA lock internals
> +------------------
> +
> +This kind of locking is entirely optimistic - if the lock is contended or a
> +competing write has started, then we do not obtain a read lock.
> +
> +The :c:func:`!lock_vma_under_rcu` function first calls :c:func:`!rcu_read_lock`
> +to ensure that the VMA is acquired in an RCU critical section, then attempts to

Maybe s/is acquired in/is looked up in/, to make it clearer that
you're talking about a VMA lookup?

> +VMA lock it via :c:func:`!vma_start_read`, before releasing the RCU lock via
> +:c:func:`!rcu_read_unlock`.
> +
> +VMA read locks hold the read lock on the :c:member:`!vma->vm_lock` semaphore for
> +their duration and the caller of :c:func:`!lock_vma_under_rcu` must release it
> +via :c:func:`!vma_end_read`.
> +
> +VMA **write** locks are acquired via :c:func:`!vma_start_write` in instances where a
> +VMA is about to be modified, unlike :c:func:`!vma_start_read` the lock is always
> +acquired. An mmap write lock **must** be held for the duration of the VMA write
> +lock, releasing or downgrading the mmap write lock also releases the VMA write
> +lock so there is no :c:func:`!vma_end_write` function.
> +
> +Note that a semaphore write lock is not held across a VMA lock. Rather, a
> +sequence number is used for serialisation, and the write semaphore is only
> +acquired at the point of write lock to update this.
> +
> +This ensures the semantics we require - VMA write locks provide exclusive write
> +access to the VMA.
> +
> +The VMA lock mechanism is designed to be a lightweight means of avoiding the use
> +of the heavily contended mmap lock. It is implemented using a combination of a
> +read/write semaphore and sequence numbers belonging to the containing
> +:c:struct:`!struct mm_struct` and the VMA.
> +
> +Read locks are acquired via :c:func:`!vma_start_read`, which is an optimistic
> +operation, i.e. it tries to acquire a read lock but returns false if it is
> +unable to do so. At the end of the read operation, :c:func:`!vma_end_read` is
> +called to release the VMA read lock. This can be done under RCU alone.

Please clarify what "This" refers to, and whether the part about RCU
is explaining an implementation detail or the API contract.

> +
> +Writing requires the mmap to be write-locked and the VMA lock to be acquired via
> +:c:func:`!vma_start_write`, however the write lock is released by the termination or
> +downgrade of the mmap write lock so no :c:func:`!vma_end_write` is required.
> +
> +All this is achieved by the use of per-mm and per-VMA sequence counts, which are
> +used in order to reduce complexity, especially for operations which write-lock
> +multiple VMAs at once.
> +
> +If the mm sequence count, :c:member:`!mm->mm_lock_seq` is equal to the VMA
> +sequence count :c:member:`!vma->vm_lock_seq` then the VMA is write-locked. If
> +they differ, then they are not.

nit: "it is not"?

> +
> +Each time an mmap write lock is acquired in :c:func:`!mmap_write_lock`,
> +:c:func:`!mmap_write_lock_nested`, :c:func:`!mmap_write_lock_killable`, the
> +:c:member:`!mm->mm_lock_seq` sequence number is incremented via
> +:c:func:`!mm_lock_seqcount_begin`.
> +
> +Each time the mmap write lock is released in :c:func:`!mmap_write_unlock` or
> +:c:func:`!mmap_write_downgrade`, :c:func:`!vma_end_write_all` is invoked which
> +also increments :c:member:`!mm->mm_lock_seq` via
> +:c:func:`!mm_lock_seqcount_end`.
> +
> +This way, we ensure regardless of the VMA's sequence number count, that a write
> +lock is not incorrectly indicated (since we increment the sequence counter on
> +acquiring the mmap write lock, which is required in order to obtain a VMA write
> +lock), and that when we release an mmap write lock, we efficiently release
> +**all** VMA write locks contained within the mmap at the same time.

Incrementing on mmap_write_lock() is not necessary for VMA locks; that
part is for future seqlock-style users of the MM sequence count that
want to work without even taking the VMA lock, with the new
mmap_lock_speculation_{begin,end} API. See commit db8f33d3b7698 and
the thread https://lore.kernel.org/linux-mm/20241010205644.3831427-5-andrii@kernel.org/
.
Kirill A. Shutemov Nov. 8, 2024, 8:26 a.m. UTC | #2
On Thu, Nov 07, 2024 at 07:01:37PM +0000, Lorenzo Stoakes wrote:
> +.. table:: Config-specific fields
> +
> +   ================================= ===================== ======================================== ===============
> +   Field                             Configuration option  Description                              Write lock
> +   ================================= ===================== ======================================== ===============
> +   :c:member:`!anon_name`            CONFIG_ANON_VMA_NAME  A field for storing a                    mmap write,
> +                                                           :c:struct:`!struct anon_vma_name`        VMA write.
> +                                                           object providing a name for anonymous
> +                                                           mappings, or :c:macro:`!NULL` if none
> +							   is set or the VMA is file-backed.
> +   :c:member:`!swap_readahead_info`  CONFIG_SWAP           Metadata used by the swap mechanism      mmap read.
> +                                                           to perform readahead.

It is not clear how writes to the field is serialized by a shared lock.

It worth noting that it is atomic.

> +   :c:member:`!vm_policy`            CONFIG_NUMA           :c:type:`!mempolicy` object which        mmap write,
> +                                                           describes the NUMA behaviour of the      VMA write.
> +							   VMA.
> +   :c:member:`!numab_state`          CONFIG_NUMA_BALANCING :c:type:`!vma_numab_state` object which  mmap read.
> +                                                           describes the current state of
> +                                                           NUMA balancing in relation to this VMA.
> +                                                           Updated under mmap read lock by
> +							   :c:func:`!task_numa_work`.

Again, shared lock serializing writes make zero sense. There's other
mechanism in play.

I believe there's some kind of scheduler logic that excludes parallel
updates for the same process. But I cannot say I understand this.

> +   :c:member:`!vm_userfaultfd_ctx`   CONFIG_USERFAULTFD    Userfaultfd context wrapper object of    mmap write,
> +                                                           type :c:type:`!vm_userfaultfd_ctx`,      VMA write.
> +                                                           either of zero size if userfaultfd is
> +                                                           disabled, or containing a pointer
> +                                                           to an underlying
> +							   :c:type:`!userfaultfd_ctx` object which
> +                                                           describes userfaultfd metadata.
> +   ================================= ===================== ======================================== ===============

...

> +Lock ordering
> +-------------
> +
> +As we have multiple locks across the kernel which may or may not be taken at the
> +same time as explicit mm or VMA locks, we have to be wary of lock inversion, and
> +the **order** in which locks are acquired and released becomes very important.
> +
> +.. note:: Lock inversion occurs when two threads need to acquire multiple locks,
> +   but in doing so inadvertently cause a mutual deadlock.
> +
> +   For example, consider thread 1 which holds lock A and tries to acquire lock B,
> +   while thread 2 holds lock B and tries to acquire lock A.
> +
> +   Both threads are now deadlocked on each other. However, had they attempted to
> +   acquire locks in the same order, one would have waited for the other to
> +   complete its work and no deadlock would have occurred.
> +
> +The opening comment in `mm/rmap.c` describes in detail the required ordering of
> +locks within memory management code:
> +
> +.. code-block::
> +
> +  inode->i_rwsem	(while writing or truncating, not reading or faulting)
> +    mm->mmap_lock
> +      mapping->invalidate_lock (in filemap_fault)
> +        folio_lock
> +          hugetlbfs_i_mmap_rwsem_key (in huge_pmd_share, see hugetlbfs below)
> +            vma_start_write
> +              mapping->i_mmap_rwsem
> +                anon_vma->rwsem
> +                  mm->page_table_lock or pte_lock
> +                    swap_lock (in swap_duplicate, swap_info_get)
> +                      mmlist_lock (in mmput, drain_mmlist and others)
> +                      mapping->private_lock (in block_dirty_folio)
> +                          i_pages lock (widely used)
> +                            lruvec->lru_lock (in folio_lruvec_lock_irq)
> +                      inode->i_lock (in set_page_dirty's __mark_inode_dirty)
> +                      bdi.wb->list_lock (in set_page_dirty's __mark_inode_dirty)
> +                        sb_lock (within inode_lock in fs/fs-writeback.c)
> +                        i_pages lock (widely used, in set_page_dirty,
> +                                  in arch-dependent flush_dcache_mmap_lock,
> +                                  within bdi.wb->list_lock in __sync_single_inode)
> +
> +Please check the current state of this comment which may have changed since the
> +time of writing of this document.

I think we need one canonical place for this information. Maybe it worth
moving it here from rmap.c? There's more locking ordering info in filemap.c.
Lorenzo Stoakes Nov. 8, 2024, 8:59 a.m. UTC | #3
On Fri, Nov 08, 2024 at 10:26:03AM +0200, Kirill A. Shutemov wrote:
> On Thu, Nov 07, 2024 at 07:01:37PM +0000, Lorenzo Stoakes wrote:
> > +.. table:: Config-specific fields
> > +
> > +   ================================= ===================== ======================================== ===============
> > +   Field                             Configuration option  Description                              Write lock
> > +   ================================= ===================== ======================================== ===============
> > +   :c:member:`!anon_name`            CONFIG_ANON_VMA_NAME  A field for storing a                    mmap write,
> > +                                                           :c:struct:`!struct anon_vma_name`        VMA write.
> > +                                                           object providing a name for anonymous
> > +                                                           mappings, or :c:macro:`!NULL` if none
> > +							   is set or the VMA is file-backed.
> > +   :c:member:`!swap_readahead_info`  CONFIG_SWAP           Metadata used by the swap mechanism      mmap read.
> > +                                                           to perform readahead.
>
> It is not clear how writes to the field is serialized by a shared lock.
>

Yes I think there is a swap-specific lock, but I'm not sure it's worth confusing
matters by including that here, maybe here and for numab I will just wave my hands for that bit.

> It worth noting that it is atomic.
>

Will add.

> > +   :c:member:`!vm_policy`            CONFIG_NUMA           :c:type:`!mempolicy` object which        mmap write,
> > +                                                           describes the NUMA behaviour of the      VMA write.
> > +							   VMA.
> > +   :c:member:`!numab_state`          CONFIG_NUMA_BALANCING :c:type:`!vma_numab_state` object which  mmap read.
> > +                                                           describes the current state of
> > +                                                           NUMA balancing in relation to this VMA.
> > +                                                           Updated under mmap read lock by
> > +							   :c:func:`!task_numa_work`.
>
> Again, shared lock serializing writes make zero sense. There's other
> mechanism in play.
>
> I believe there's some kind of scheduler logic that excludes parallel
> updates for the same process. But I cannot say I understand this.

Ack, agreed, see above, hand waving probably required :)

>
> > +   :c:member:`!vm_userfaultfd_ctx`   CONFIG_USERFAULTFD    Userfaultfd context wrapper object of    mmap write,
> > +                                                           type :c:type:`!vm_userfaultfd_ctx`,      VMA write.
> > +                                                           either of zero size if userfaultfd is
> > +                                                           disabled, or containing a pointer
> > +                                                           to an underlying
> > +							   :c:type:`!userfaultfd_ctx` object which
> > +                                                           describes userfaultfd metadata.
> > +   ================================= ===================== ======================================== ===============
>
> ...
>
> > +Lock ordering
> > +-------------
> > +
> > +As we have multiple locks across the kernel which may or may not be taken at the
> > +same time as explicit mm or VMA locks, we have to be wary of lock inversion, and
> > +the **order** in which locks are acquired and released becomes very important.
> > +
> > +.. note:: Lock inversion occurs when two threads need to acquire multiple locks,
> > +   but in doing so inadvertently cause a mutual deadlock.
> > +
> > +   For example, consider thread 1 which holds lock A and tries to acquire lock B,
> > +   while thread 2 holds lock B and tries to acquire lock A.
> > +
> > +   Both threads are now deadlocked on each other. However, had they attempted to
> > +   acquire locks in the same order, one would have waited for the other to
> > +   complete its work and no deadlock would have occurred.
> > +
> > +The opening comment in `mm/rmap.c` describes in detail the required ordering of
> > +locks within memory management code:
> > +
> > +.. code-block::
> > +
> > +  inode->i_rwsem	(while writing or truncating, not reading or faulting)
> > +    mm->mmap_lock
> > +      mapping->invalidate_lock (in filemap_fault)
> > +        folio_lock
> > +          hugetlbfs_i_mmap_rwsem_key (in huge_pmd_share, see hugetlbfs below)
> > +            vma_start_write
> > +              mapping->i_mmap_rwsem
> > +                anon_vma->rwsem
> > +                  mm->page_table_lock or pte_lock
> > +                    swap_lock (in swap_duplicate, swap_info_get)
> > +                      mmlist_lock (in mmput, drain_mmlist and others)
> > +                      mapping->private_lock (in block_dirty_folio)
> > +                          i_pages lock (widely used)
> > +                            lruvec->lru_lock (in folio_lruvec_lock_irq)
> > +                      inode->i_lock (in set_page_dirty's __mark_inode_dirty)
> > +                      bdi.wb->list_lock (in set_page_dirty's __mark_inode_dirty)
> > +                        sb_lock (within inode_lock in fs/fs-writeback.c)
> > +                        i_pages lock (widely used, in set_page_dirty,
> > +                                  in arch-dependent flush_dcache_mmap_lock,
> > +                                  within bdi.wb->list_lock in __sync_single_inode)
> > +
> > +Please check the current state of this comment which may have changed since the
> > +time of writing of this document.
>
> I think we need one canonical place for this information. Maybe it worth
> moving it here from rmap.c? There's more locking ordering info in filemap.c.

Re: canonical place - yes I agree, once this doc goes I can follow up with
a patch that replaces the comment with a link to the official kernel.org
latest docs etc.? That would also allow us to render this more nicely
perhaps, in future.

Re: mm/filemap.c - these are getting into file system-specific stuff so not
sure if the right place but there's a lot of overlap, maybe worth importing
anyway.

>
> --
>   Kiryl Shutsemau / Kirill A. Shutemov

Thanks!
Lorenzo Stoakes Nov. 8, 2024, 10:35 a.m. UTC | #4
On Thu, Nov 07, 2024 at 10:15:31PM +0100, Jann Horn wrote:
> On Thu, Nov 7, 2024 at 8:02 PM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> > Locking around VMAs is complicated and confusing. While we have a number of
> > disparate comments scattered around the place, we seem to be reaching a
> > level of complexity that justifies a serious effort at clearly documenting
> > how locks are expected to be used when it comes to interacting with
> > mm_struct and vm_area_struct objects.
>
> Thanks, I think this is looking pretty good now.

Thanks! I think we're iterating towards the final product bit by bit.

>
> > +VMA fields
> > +^^^^^^^^^^
> > +
> > +We can subdivide :c:struct:`!struct vm_area_struct` fields by their purpose, which makes it
> > +easier to explore their locking characteristics:
> > +
> > +.. note:: We exclude VMA lock-specific fields here to avoid confusion, as these
> > +         are in effect an internal implementation detail.
> > +
> > +.. table:: Virtual layout fields
> > +
> > +   ===================== ======================================== ===========
> > +   Field                 Description                              Write lock
> > +   ===================== ======================================== ===========
> [...]
> > +   :c:member:`!vm_pgoff` Describes the page offset into the file, rmap write.
> > +                         the original page offset within the      mmap write,
> > +                         virtual address space (prior to any      rmap write.
> > +                         :c:func:`!mremap`), or PFN if a PFN map
> > +                         and the architecture does not support
> > +                         :c:macro:`!CONFIG_ARCH_HAS_PTE_SPECIAL`.
>
> Is that a typo in the rightmost column? "rmap write. mmap write, rmap
> write." lists rmap twice

Yep, I noticed this very shortly after sending the patch, have fixed for v2.

>
> > +   ===================== ======================================== ===========
> > +
> > +These fields describes the size, start and end of the VMA, and as such cannot be
>
> s/describes/describe/
>
> > +modified without first being hidden from the reverse mapping since these fields
> > +are used to locate VMAs within the reverse mapping interval trees.
> [...]
> > +.. table:: Reverse mapping fields
> > +
> > +   =================================== ========================================= ================
> > +   Field                               Description                               Write lock
> > +   =================================== ========================================= ================
> > +   :c:member:`!shared.rb`              A red/black tree node used, if the        mmap write,
> > +                                       mapping is file-backed, to place the VMA  VMA write,
> > +                                       in the                                    i_mmap write.
> > +                                       :c:member:`!struct address_space->i_mmap`
> > +                                      red/black interval tree.
>
> This list of write locks is correct regarding what locks you need to
> make changes to the VMA's tree membership. Technically at a lower
> level, the contents of vma->shared.rb are written while holding only
> the file rmap lock when the surrounding rbtree nodes change, but
> that's because the rbtree basically takes ownership of the node once
> it's been inserted into the tree. But I don't know how to concisely
> express that here, and it's kind of a nitpicky detail, so I think the
> current version looks good.

Yeah, I think we have to limit how far we go to keep this readable :)

>
> Maybe you could add "For changing the VMA's association with the
> rbtree:" on top of the list of write locks for this one?
>
> > +   :c:member:`!shared.rb_subtree_last` Metadata used for management of the
> > +                                       interval tree if the VMA is file-backed.  mmap write,
> > +                                                                                 VMA write,
> > +                                                                                 i_mmap write.
>
> For this one, I think it would be more accurate to say that it is
> written under just the i_mmap write lock. Though maybe that looks
> kinda inconsistent with the previous one...

But I think you probably have to hold the others to make the change, I think we
are good to hand-wave a bit here.

There's an argument for not even listing these fields as an impl detail (as
I decided to do with the VMA lock fields), but to be consistent with
anon_vma which we really do _have_ to talk about.

>
> > +   :c:member:`!anon_vma_chain`         List of links to forked/CoW’d anon_vma    mmap read,
> > +                                       objects.                                  anon_vma write.
>
> Technically not just forked/CoW'd ones, but also the current one.
> Maybe something like "List of links to anon_vma objects (including
> inherited ones) that anonymous pages can be associated with"?

Yeah, true, I thought it was important to emphasise this as you have the
one you will use for exclusively mapped folios in ->anon_vma, but we should
mention this, will update to be more accurate.

I don't like the 'related anon_vma's or the vagueness of 'associated with'
so I think better to say 'forked/CoW'd anon_vma objects and vma->anon_vma
if non-NULL'.

The whole topic of anon_vma's is a fraught mess (hey I did some nice
diagrams describing it in the book though :P) so want to be as specific as
I can here.

>
> > +   :c:member:`!anon_vma`               :c:type:`!anon_vma` object used by        mmap_read,
> > +                                       anonymous folios mapped exclusively to    page_table_lock.
> > +                                      this VMA.
>
> move_vma() uses unlink_anon_vmas() to change ->anon_vma from non-NULL
> to NULL. There we hold:
>
>  - mmap lock (exclusive, from sys_mremap)
>  - VMA write lock (from move_vma)
>  - anon_vma lock (from unlink_anon_vmas)
>
> So it's not true that we always hold the page_table_lock for this.
>
> Should this maybe have two separate parts, one for "for changing NULL
> -> non-NULL" and one for "changing non-NULL to NULL"? Where the
> NULL->non-NULL scenario uses the locks you listed and non-NULL->NULL
> relies on write-locking the VMA and the anon_vma?

Yeah, there's some annoying inconsistencies, I sort of meant this as 'at
minimum' thinking the anon_vma lock might be implicit once set but that's
silly, you're right, I will be explicit here and update as you say.

Have added some more details on anon_vma_prepare() etc here too.

>
> > +   =================================== ========================================= ================
> > +
> > +These fields are used to both place the VMA within the reverse mapping, and for
> > +anonymous mappings, to be able to access both related :c:struct:`!struct anon_vma` objects
> > +and the :c:struct:`!struct anon_vma` which folios mapped exclusively to this VMA should
>
> typo: s/which folios/in which folios/

Ack, fixed.

>
> > +reside.
> > +
> > +Page tables
> > +-----------
> > +
> > +We won't speak exhaustively on the subject but broadly speaking, page tables map
> > +virtual addresses to physical ones through a series of page tables, each of
> > +which contain entries with physical addresses for the next page table level
> > +(along with flags), and at the leaf level the physical addresses of the
> > +underlying physical data pages (with offsets into these pages provided by the
> > +virtual address itself).
> > +
> > +In Linux these are divided into five levels - PGD, P4D, PUD, PMD and PTE. Huge
> > +pages might eliminate one or two of these levels, but when this is the case we
> > +typically refer to the leaf level as the PTE level regardless.
>
> (That last sentence doesn't match my headcanon but I also don't have
> any reference for what is common Linux kernel phrasing around this so
> this isn't really an actionable comment.)

Does it match your head directory/entry? ;)

>
> > +.. note:: In instances where the architecture supports fewer page tables than
> > +   five the kernel cleverly 'folds' page table levels, that is skips them within
> > +   the logic, regardless we can act as if there were always five.
> > +
> > +There are three key operations typically performed on page tables:
> > +
> > +1. **Installing** page table mappings - whether creating a new mapping or
> > +   modifying an existing one.
> > +2. **Zapping/unmapping** page tables - This is what the kernel calls clearing page
>
> bikeshedding, feel free to ignore:
> Maybe "Zapping/unmapping page table entries"? At least that's how I
> always read "zap_pte_range()" in my head - "zap page table entry
> range". Though I don't know if that's the canonical interpretation.

Yeah that's a good idea actually, will update.

>
> > +   table mappings at the leaf level only, whilst leaving all page tables in
> > +   place. This is a very common operation in the kernel performed on file
> > +   truncation, the :c:macro:`!MADV_DONTNEED` operation via :c:func:`!madvise`,
> > +   and others. This is performed by a number of functions including
> > +   :c:func:`!unmap_mapping_range`, :c:func:`!unmap_mapping_pages` and reverse
> > +   mapping logic.
> [...]
> > +Locking rules
> > +^^^^^^^^^^^^^
> > +
> > +We establish basic locking rules when interacting with page tables:
> > +
> > +* When changing a page table entry the page table lock for that page table
> > +  **must** be held.
>
> (except, as you described below, in free_pgtables() when changing page
> table entries pointing to lower-level page tables)

Ack will spell that out.

>
> > +* Reads from and writes to page table entries must be appropriately atomic. See
> > +  the section on atomicity below.
> [...]
> > +Page table installation
> > +^^^^^^^^^^^^^^^^^^^^^^^
> > +
> > +When allocating a P4D, PUD or PMD and setting the relevant entry in the above
> > +PGD, P4D or PUD, the :c:member:`!mm->page_table_lock` must be held. This is
> > +acquired in :c:func:`!__p4d_alloc`, :c:func:`!__pud_alloc` and
> > +:c:func:`!__pmd_alloc` respectively.
> > +
> > +.. note:: :c:func:`!__pmd_alloc` actually invokes :c:func:`!pud_lock` and
> > +   :c:func:`!pud_lockptr` in turn, however at the time of writing it ultimately
> > +   references the :c:member:`!mm->page_table_lock`.
> > +
> > +Allocating a PTE will either use the :c:member:`!mm->page_table_lock` or, if
> > +:c:macro:`!USE_SPLIT_PMD_PTLOCKS` is defined, used a lock embedded in the PMD
>
> typo: s/used/use/

Ack but probably better as s/used// here I think :>) Fixed.

>
> > +physical page metadata in the form of a :c:struct:`!struct ptdesc`, acquired by
> > +:c:func:`!pmd_ptdesc` called from :c:func:`!pmd_lock` and ultimately
> > +:c:func:`!__pte_alloc`.
> > +
> > +Finally, modifying the contents of the PTE has special treatment, as this is a
>
> nit: unclear what "this" refers to here - it looks like it refers to
> "the PTE", but "the PTE is a lock" wouldn't make grammatical sense

The PTE lock, have fixed.

>
> > +lock that we must acquire whenever we want stable and exclusive access to
> > +entries pointing to data pages within a PTE, especially when we wish to modify
> > +them.
>
> I don't think "entries pointing to data pages" need any more locking
> than other entries, like swap entries or migration markers?

Ack updated.

>
> > +This is performed via :c:func:`!pte_offset_map_lock` which carefully checks to
> > +ensure that the PTE hasn't changed from under us, ultimately invoking
> > +:c:func:`!pte_lockptr` to obtain a spin lock at PTE granularity contained within
> > +the :c:struct:`!struct ptdesc` associated with the physical PTE page. The lock
> > +must be released via :c:func:`!pte_unmap_unlock`.
> > +
> > +.. note:: There are some variants on this, such as
> > +   :c:func:`!pte_offset_map_rw_nolock` when we know we hold the PTE stable but
> > +   for brevity we do not explore this.  See the comment for
> > +   :c:func:`!__pte_offset_map_lock` for more details.
> > +
> > +When modifying data in ranges we typically only wish to allocate higher page
> > +tables as necessary, using these locks to avoid races or overwriting anything,
> > +and set/clear data at the PTE level as required (for instance when page faulting
> > +or zapping).
> [...]
> > +Page table moving
> > +^^^^^^^^^^^^^^^^^
> > +
> > +Some functions manipulate page table levels above PMD (that is PUD, P4D and PGD
> > +page tables). Most notable of these is :c:func:`!mremap`, which is capable of
> > +moving higher level page tables.
> > +
> > +In these instances, it is either required that **all** locks are taken, that is
> > +the mmap lock, the VMA lock and the relevant rmap lock, or that the mmap lock
> > +and VMA locks are taken and some other measure is taken to avoid rmap races (see
> > +the comment in :c:func:`!move_ptes` in the :c:func:`!mremap` implementation for
> > +details of how this is handled in this instance).
>
> mremap() always takes the rmap locks when moving entire page tables,
> and AFAIK that is necessary to avoid races that lead to TLB flushes
> going to the wrong address. mremap() sometimes moves *leaf entries*
> without holding rmap locks, but never entire tables.
>
> move_pgt_entry() is confusingly written - need_rmap_locks is actually
> always true in the NORMAL_* cases that move non-leaf entries.

OK you're right, this NORMAL_ vs HPAGE_ thing... ugh mremap() needs a TOTAL
REFACTOR. Another bit of churn for Lorenzo churn king Stoakes to get to at
some point :P

Have updated.

I eventually want to put as much as I possibly can into mm/vma.c so we can
make as many things userland unit testable as possible. But that's in the
future :)

>
> > +You can observe that in the :c:func:`!mremap` implementation in the functions
> > +:c:func:`!take_rmap_locks` and :c:func:`!drop_rmap_locks` which perform the rmap
> > +side of lock acquisition, invoked ultimately by :c:func:`!move_page_tables`.
> > +
> > +VMA lock internals
> > +------------------
> > +
> > +This kind of locking is entirely optimistic - if the lock is contended or a
> > +competing write has started, then we do not obtain a read lock.
> > +
> > +The :c:func:`!lock_vma_under_rcu` function first calls :c:func:`!rcu_read_lock`
> > +to ensure that the VMA is acquired in an RCU critical section, then attempts to
>
> Maybe s/is acquired in/is looked up in/, to make it clearer that
> you're talking about a VMA lookup?

Ack, this is why the maple tree is so critical here as it's
'RCU-friendly'. Have updated.

>
> > +VMA lock it via :c:func:`!vma_start_read`, before releasing the RCU lock via
> > +:c:func:`!rcu_read_unlock`.
> > +
> > +VMA read locks hold the read lock on the :c:member:`!vma->vm_lock` semaphore for
> > +their duration and the caller of :c:func:`!lock_vma_under_rcu` must release it
> > +via :c:func:`!vma_end_read`.
> > +
> > +VMA **write** locks are acquired via :c:func:`!vma_start_write` in instances where a
> > +VMA is about to be modified, unlike :c:func:`!vma_start_read` the lock is always
> > +acquired. An mmap write lock **must** be held for the duration of the VMA write
> > +lock, releasing or downgrading the mmap write lock also releases the VMA write
> > +lock so there is no :c:func:`!vma_end_write` function.
> > +
> > +Note that a semaphore write lock is not held across a VMA lock. Rather, a
> > +sequence number is used for serialisation, and the write semaphore is only
> > +acquired at the point of write lock to update this.
> > +
> > +This ensures the semantics we require - VMA write locks provide exclusive write
> > +access to the VMA.
> > +
> > +The VMA lock mechanism is designed to be a lightweight means of avoiding the use
> > +of the heavily contended mmap lock. It is implemented using a combination of a
> > +read/write semaphore and sequence numbers belonging to the containing
> > +:c:struct:`!struct mm_struct` and the VMA.
> > +
> > +Read locks are acquired via :c:func:`!vma_start_read`, which is an optimistic
> > +operation, i.e. it tries to acquire a read lock but returns false if it is
> > +unable to do so. At the end of the read operation, :c:func:`!vma_end_read` is
> > +called to release the VMA read lock. This can be done under RCU alone.
>
> Please clarify what "This" refers to, and whether the part about RCU
> is explaining an implementation detail or the API contract.

Ack have added a chonky comment on this.

>
> > +
> > +Writing requires the mmap to be write-locked and the VMA lock to be acquired via
> > +:c:func:`!vma_start_write`, however the write lock is released by the termination or
> > +downgrade of the mmap write lock so no :c:func:`!vma_end_write` is required.
> > +
> > +All this is achieved by the use of per-mm and per-VMA sequence counts, which are
> > +used in order to reduce complexity, especially for operations which write-lock
> > +multiple VMAs at once.
> > +
> > +If the mm sequence count, :c:member:`!mm->mm_lock_seq` is equal to the VMA
> > +sequence count :c:member:`!vma->vm_lock_seq` then the VMA is write-locked. If
> > +they differ, then they are not.
>
> nit: "it is not"?

Ack, fixed.

>
> > +
> > +Each time an mmap write lock is acquired in :c:func:`!mmap_write_lock`,
> > +:c:func:`!mmap_write_lock_nested`, :c:func:`!mmap_write_lock_killable`, the
> > +:c:member:`!mm->mm_lock_seq` sequence number is incremented via
> > +:c:func:`!mm_lock_seqcount_begin`.
> > +
> > +Each time the mmap write lock is released in :c:func:`!mmap_write_unlock` or
> > +:c:func:`!mmap_write_downgrade`, :c:func:`!vma_end_write_all` is invoked which
> > +also increments :c:member:`!mm->mm_lock_seq` via
> > +:c:func:`!mm_lock_seqcount_end`.
> > +
> > +This way, we ensure regardless of the VMA's sequence number count, that a write
> > +lock is not incorrectly indicated (since we increment the sequence counter on
> > +acquiring the mmap write lock, which is required in order to obtain a VMA write
> > +lock), and that when we release an mmap write lock, we efficiently release
> > +**all** VMA write locks contained within the mmap at the same time.
>
> Incrementing on mmap_write_lock() is not necessary for VMA locks; that
> part is for future seqlock-style users of the MM sequence count that
> want to work without even taking the VMA lock, with the new
> mmap_lock_speculation_{begin,end} API. See commit db8f33d3b7698 and
> the thread https://lore.kernel.org/linux-mm/20241010205644.3831427-5-andrii@kernel.org/
> .

Right, was aware of that part but thought we'd want to increment anyway,
however I suppose given you increment on lock _release_ it isn't
necessary. We will be extending this section (or... Suren maybe will? ;)
Bagas Sanjaya Nov. 8, 2024, 12:25 p.m. UTC | #5
On Thu, Nov 07, 2024 at 07:01:37PM +0000, Lorenzo Stoakes wrote:
> +.. note:: In instances where the architecture supports fewer page tables than
> +   five the kernel cleverly 'folds' page table levels, that is skips them within
> +   the logic, regardless we can act as if there were always five.

What are being skipped if e.g. we only have 3 or 4 page tables?

Confused...
Lorenzo Stoakes Nov. 8, 2024, 12:29 p.m. UTC | #6
On Fri, Nov 08, 2024 at 07:25:38PM +0700, Bagas Sanjaya wrote:
> On Thu, Nov 07, 2024 at 07:01:37PM +0000, Lorenzo Stoakes wrote:
> > +.. note:: In instances where the architecture supports fewer page tables than
> > +   five the kernel cleverly 'folds' page table levels, that is skips them within
> > +   the logic, regardless we can act as if there were always five.
>
> What are being skipped if e.g. we only have 3 or 4 page tables?
>
> Confused...

Page table levels, see [0].

Typically achieved through stubbing functions out, etc. So in the code you
actually write and _conceptually_ there are five levels, only in the final
compile you might have things like p4d_present(), p4d_clear() etc. squashed by
the compiler into nothing.

I don't want to get too bogged down into the details of this in the doc, more of
an aside!

[0]: https://elixir.bootlin.com/linux/v6.11.6/source/include/asm-generic/pgtable-nopud.h#L9

>
> --
> An old man doll... just what I always wanted! - Clara
Lorenzo Stoakes Nov. 8, 2024, 12:34 p.m. UTC | #7
On Fri, Nov 08, 2024 at 12:29:15PM +0000, Lorenzo Stoakes wrote:
> On Fri, Nov 08, 2024 at 07:25:38PM +0700, Bagas Sanjaya wrote:
> > On Thu, Nov 07, 2024 at 07:01:37PM +0000, Lorenzo Stoakes wrote:
> > > +.. note:: In instances where the architecture supports fewer page tables than
> > > +   five the kernel cleverly 'folds' page table levels, that is skips them within
> > > +   the logic, regardless we can act as if there were always five.
> >
> > What are being skipped if e.g. we only have 3 or 4 page tables?
> >
> > Confused...
>
> Page table levels, see [0].
>
> Typically achieved through stubbing functions out, etc. So in the code you
> actually write and _conceptually_ there are five levels, only in the final
> compile you might have things like p4d_present(), p4d_clear() etc. squashed by
> the compiler into nothing.

I have updated this note to be a little clearer and to explicitly state
this. Thanks!

>
> I don't want to get too bogged down into the details of this in the doc, more of
> an aside!
>
> [0]: https://elixir.bootlin.com/linux/v6.11.6/source/include/asm-generic/pgtable-nopud.h#L9
>
> >
> > --
> > An old man doll... just what I always wanted! - Clara
Wei Yang Nov. 11, 2024, 2:42 a.m. UTC | #8
On Thu, Nov 07, 2024 at 07:01:37PM +0000, Lorenzo Stoakes wrote:
>Andrew - As mm-specific docs were brought under the mm tree in a recent
>change to MAINTAINERS I believe this ought to go through your tree?
>
>
>Locking around VMAs is complicated and confusing. While we have a number of
>disparate comments scattered around the place, we seem to be reaching a
>level of complexity that justifies a serious effort at clearly documenting
>how locks are expected to be used when it comes to interacting with
>mm_struct and vm_area_struct objects.
>
>This is especially pertinent as regards the efforts to find sensible
>abstractions for these fundamental objects in kernel rust code whose
>compiler strictly requires some means of expressing these rules (and
>through this expression, self-document these requirements as well as
>enforce them).
>
>The document limits scope to mmap and VMA locks and those that are
>immediately adjacent and relevant to them - so additionally covers page
>table locking as this is so very closely tied to VMA operations (and relies
>upon us handling these correctly).
>
>The document tries to cover some of the nastier and more confusing edge
>cases and concerns especially around lock ordering and page table teardown.
>
>The document is split between generally useful information for users of mm
>interfaces, and separately a section intended for mm kernel developers
>providing a discussion around internal implementation details.
>
>Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>---
>
>REVIEWERS NOTES:
>* Apologies if I missed any feedback, I believe I have taken everything
>  into account but do let me know if I missed anything.
>* As before, for convenience, I've uploaded a render of this document to my
>  website at https://ljs.io/output/mm/process_addrs
>* You can speed up doc builds by running `make SPHINXDIRS=mm htmldocs`.
>
>v1:
>* Removed RFC tag as I think we are iterating towards something workable
>  and there is interest.
>* Cleaned up and sharpened the language, structure and layout. Separated
>  into top-level details and implementation sections as per Alice.
>* Replaced links with rather more readable formatting.
>* Improved valid mmap/VMA lock state table.
>* Put VMA locks section into the process addresses document as per SJ and
>  Mike.
>* Made clear as to read/write operations against VMA object rather than
>  userland memory, as per Mike's suggestion, also that it does not refer to
>  page tables as per Jann.
>* Moved note into main section as per Mike's suggestion.
>* Fixed grammar mistake as per Mike.
>* Converted list-table to table as per Mike.
>* Corrected various typos as per Jann, Suren.
>* Updated reference to page fault arches as per Jann.
>* Corrected mistaken write lock criteria for vm_lock_seq as per Jann.
>* Updated vm_pgoff description to reference CONFIG_ARCH_HAS_PTE_SPECIAL as
>  per Jann.
>* Updated write lock to mmap read for vma->numab_state as per Jann.
>* Clarified that the write lock is on the mmap and VMA lock at VMA
>  granularity earlier in description as per Suren.
>* Added explicit note at top of VMA lock section to explicitly highlight
>  VMA lock semantics as per Suren.
>* Updated required locking for vma lock fields to N/A to avoid confusion as
>  per Suren.
>* Corrected description of mmap_downgrade() as per Suren.
>* Added a note on gate VMAs as per Jann.
>* Explained that taking mmap read lock under VMA lock is a bad idea due to
>  deadlock as per Jann.
>* Discussed atomicity in page table operations as per Jann.
>* Adapted the well thought out page table locking rules as provided by Jann.
>* Added a comment about pte mapping maintaining an RCU read lock.
>* Added clarification on moving page tables as informed by Jann's comments
>  (though it turns out mremap() doesn't necessarily hold all locks if it
>  can resolve races other ways :)
>* Added Jann's diagram showing lock exclusivity characteristics.
>
>RFC:
>https://lore.kernel.org/all/20241101185033.131880-1-lorenzo.stoakes@oracle.com/
>
> Documentation/mm/process_addrs.rst | 678 +++++++++++++++++++++++++++++
> 1 file changed, 678 insertions(+)
>
>diff --git a/Documentation/mm/process_addrs.rst b/Documentation/mm/process_addrs.rst
>index e8618fbc62c9..a01a7bcf39ff 100644
>--- a/Documentation/mm/process_addrs.rst
>+++ b/Documentation/mm/process_addrs.rst
>@@ -3,3 +3,681 @@
> =================
> Process Addresses
> =================
>+
>+.. toctree::
>+   :maxdepth: 3
>+
>+
>+Userland memory ranges are tracked by the kernel via Virtual Memory Areas or
>+'VMA's of type :c:struct:`!struct vm_area_struct`.
>+
>+Each VMA describes a virtually contiguous memory range with identical
>+attributes, each of which described by a :c:struct:`!struct vm_area_struct`
>+object. Userland access outside of VMAs is invalid except in the case where an
>+adjacent stack VMA could be extended to contain the accessed address.
>+
>+All VMAs are contained within one and only one virtual address space, described
>+by a :c:struct:`!struct mm_struct` object which is referenced by all tasks (that is,
>+threads) which share the virtual address space. We refer to this as the
>+:c:struct:`!mm`.
>+
>+Each mm object contains a maple tree data structure which describes all VMAs
>+within the virtual address space.
>+
>+.. note:: An exception to this is the 'gate' VMA which is provided for
>+	  architectures which use :c:struct:`!vsyscall` and is a global static
>+	  object which does not belong to any specific mm.
>+
>+-------
>+Locking
>+-------
>+
>+The kernel is designed to be highly scalable against concurrent read operations
>+on VMA **metadata** so a complicated set of locks are required to ensure memory
>+corruption does not occur.
>+
>+.. note:: Locking VMAs for their metadata does not have any impact on the memory
>+	  they describe or the page tables that map them.
>+
>+Terminology
>+-----------
>+
>+* **mmap locks** - Each MM has a read/write semaphore `mmap_lock` which locks at
>+  a process address space granularity which can be acquired via
>+  :c:func:`!mmap_read_lock`, :c:func:`!mmap_write_lock` and variants.
>+* **VMA locks** - The VMA lock is at VMA granularity (of course) which behaves
>+  as a read/write semaphore in practice. A VMA read lock is obtained via
>+  :c:func:`!lock_vma_under_rcu` (and unlocked via :c:func:`!vma_end_read`) and a
>+  write lock via :c:func:`!vma_start_write` (all VMA write locks are unlocked
>+  automatically when the mmap write lock is released). To take a VMA write lock
>+  you **must** have already acquired an :c:func:`!mmap_write_lock`.
>+* **rmap locks** - When trying to access VMAs through the reverse mapping via a
>+  :c:struct:`!struct address_space *` or :c:struct:`!struct anon_vma *` object
>+  (each obtainable from a folio), VMAs must be stabilised via
>+  :c:func:`!anon_vma_[try]lock_read` or :c:func:`!anon_vma_[try]lock_write` for
>+  anonymous memory and :c:func:`!i_mmap_[try]lock_read` or
>+  :c:func:`!i_mmap_[try]lock_write` for file-backed memory. We refer to these
>+  locks as the reverse mapping locks, or 'rmap locks' for brevity.
>+
>+We discuss page table locks separately in the dedicated section below.
>+
>+The first thing **any** of these locks achieve is to **stabilise** the VMA
>+within the MM tree. That is, guaranteeing that the VMA object will not be
>+deleted from under you nor modified (except for some specific exceptions
>+describe below).
>+
>+Stabilising a VMA also keeps the address space described by it around.
>+
>+Using address space locks
>+-------------------------
>+
>+If you want to **read** VMA metadata fields or just keep the VMA stable, you
>+must do one of the following:
>+
>+* Obtain an mmap read lock at the MM granularity via :c:func:`!mmap_read_lock` (or a
>+  suitable variant), unlocking it with a matching :c:func:`!mmap_read_unlock` when
>+  you're done with the VMA, *or*
>+* Try to obtain a VMA read lock via :c:func:`!lock_vma_under_rcu`. This tries to
>+  acquire the lock atomically so might fail, in which case fall-back logic is
>+  required to instead obtain an mmap read lock if this returns :c:macro:`!NULL`,
>+  *or*
>+* Acquire an rmap lock before traversing the locked interval tree (whether
>+  anonymous or file-backed) to obtain the required VMA.
>+
>+If you want to **write** VMA metadata fields, then things vary depending on the
>+field (we explore each VMA field in detail below). For the majority you must:
>+
>+* Obtain an mmap write lock at the MM granularity via :c:func:`!mmap_write_lock` (or a
>+  suitable variant), unlocking it with a matching :c:func:`!mmap_write_unlock` when
>+  you're done with the VMA, *and*
>+* Obtain a VMA write lock via :c:func:`!vma_start_write` for each VMA you wish to
>+  modify, which will be released automatically when :c:func:`!mmap_write_unlock` is
>+  called.
>+* If you want to be able to write to **any** field, you must also hide the VMA
>+  from the reverse mapping by obtaining an **rmap write lock**.
>+
>+VMA locks are special in that you must obtain an mmap **write** lock **first**
>+in order to obtain a VMA **write** lock. A VMA **read** lock however can be
>+obtained under an RCU lock alone.
>+
>+.. note:: The primary users of VMA read locks are page fault handlers, which
>+	  means that without a VMA write lock, page faults will run concurrent with
>+	  whatever you are doing.
>+
>+Examining all valid lock states:
>+
>+.. table::
>+
>+   ========= ======== ========= ======= ===== =========== ==========
>+   mmap lock VMA lock rmap lock Stable? Read? Write most? Write all?
>+   ========= ======== ========= ======= ===== =========== ==========
>+   \-        \-       \-        N       N     N           N
>+   \-        R        \-        Y       Y     N           N
>+   \-        \-       R/W       Y       Y     N           N
>+   R/W       \-/R     \-/R/W    Y       Y     N           N
>+   W         W        \-/R      Y       Y     Y           N
>+   W         W        W         Y       Y     Y           Y
>+   ========= ======== ========= ======= ===== =========== ==========
>+
>+.. warning:: While it's possible to obtain a VMA lock while holding an mmap read lock,
>+	     attempting to do the reverse is invalid as it can result in deadlock - if
>+	     another task already holds an mmap write lock and attempts to acquire a VMA
>+	     write lock that will deadlock on the VMA read lock.
>+
>+All of these locks behave as read/write semaphores in practice, so you can
>+obtain either a read or a write lock for both.
>+
>+.. note:: Generally speaking, a read/write semaphore is a class of lock which
>+	  permits concurrent readers. However a write lock can only be obtained
>+	  once all readers have left the critical region (and pending readers
>+	  made to wait).
>+
>+	  This renders read locks on a read/write semaphore concurrent with other
>+	  readers and write locks exclusive against all others holding the semaphore.
>+
>+VMA fields
>+^^^^^^^^^^
>+
>+We can subdivide :c:struct:`!struct vm_area_struct` fields by their purpose, which makes it
>+easier to explore their locking characteristics:
>+
>+.. note:: We exclude VMA lock-specific fields here to avoid confusion, as these
>+	  are in effect an internal implementation detail.
>+
>+.. table:: Virtual layout fields
>+
>+   ===================== ======================================== ===========
>+   Field                 Description                              Write lock
>+   ===================== ======================================== ===========
>+   :c:member:`!vm_start` Inclusive start virtual address of range mmap write,
>+                         VMA describes.                           VMA write,
>+                                                                  rmap write.
>+   :c:member:`!vm_end`   Exclusive end virtual address of range   mmap write,
>+                         VMA describes.                           VMA write,
>+                                                                  rmap write.
>+   :c:member:`!vm_pgoff` Describes the page offset into the file, rmap write.
>+                         the original page offset within the      mmap write,
>+                         virtual address space (prior to any      rmap write.
>+                         :c:func:`!mremap`), or PFN if a PFN map
>+                         and the architecture does not support
>+                         :c:macro:`!CONFIG_ARCH_HAS_PTE_SPECIAL`.
>+   ===================== ======================================== ===========
>+
>+These fields describes the size, start and end of the VMA, and as such cannot be
>+modified without first being hidden from the reverse mapping since these fields
>+are used to locate VMAs within the reverse mapping interval trees.
>+
>+.. table:: Core fields
>+
>+   ============================ ======================================== =========================
>+   Field                        Description                              Write lock
>+   ============================ ======================================== =========================
>+   :c:member:`!vm_mm`           Containing mm_struct.                    None - written once on
>+                                                                         initial map.
>+   :c:member:`!vm_page_prot`    Architecture-specific page table         mmap write, VMA write.
>+                                protection bits determined from VMA
>+                                flags
>+   :c:member:`!vm_flags`        Read-only access to VMA flags describing N/A
>+                                attributes of the VMA, in union with
>+                                private writable
>+				:c:member:`!__vm_flags`.
>+   :c:member:`!__vm_flags`      Private, writable access to VMA flags    mmap write, VMA write.
>+                                field, updated by
>+                                :c:func:`!vm_flags_*` functions.
>+   :c:member:`!vm_file`         If the VMA is file-backed, points to a   None - written once on
>+                                struct file object describing the        initial map.
>+                                underlying file, if anonymous then
>+				:c:macro:`!NULL`.
>+   :c:member:`!vm_ops`          If the VMA is file-backed, then either   None - Written once on
>+                                the driver or file-system provides a     initial map by
>+                                :c:struct:`!struct vm_operations_struct` :c:func:`!f_ops->mmap()`.
>+				object describing callbacks to be
>+                                invoked on VMA lifetime events.
>+   :c:member:`!vm_private_data` A :c:member:`!void *` field for          Handled by driver.
>+                                driver-specific metadata.
>+   ============================ ======================================== =========================
>+
>+These are the core fields which describe the MM the VMA belongs to and its attributes.
>+
>+.. table:: Config-specific fields
>+
>+   ================================= ===================== ======================================== ===============
>+   Field                             Configuration option  Description                              Write lock
>+   ================================= ===================== ======================================== ===============
>+   :c:member:`!anon_name`            CONFIG_ANON_VMA_NAME  A field for storing a                    mmap write,
>+                                                           :c:struct:`!struct anon_vma_name`        VMA write.
>+                                                           object providing a name for anonymous
>+                                                           mappings, or :c:macro:`!NULL` if none
>+							   is set or the VMA is file-backed.
>+   :c:member:`!swap_readahead_info`  CONFIG_SWAP           Metadata used by the swap mechanism      mmap read.
>+                                                           to perform readahead.
>+   :c:member:`!vm_policy`            CONFIG_NUMA           :c:type:`!mempolicy` object which        mmap write,
>+                                                           describes the NUMA behaviour of the      VMA write.
>+							   VMA.
>+   :c:member:`!numab_state`          CONFIG_NUMA_BALANCING :c:type:`!vma_numab_state` object which  mmap read.
>+                                                           describes the current state of
>+                                                           NUMA balancing in relation to this VMA.
>+                                                           Updated under mmap read lock by
>+							   :c:func:`!task_numa_work`.
>+   :c:member:`!vm_userfaultfd_ctx`   CONFIG_USERFAULTFD    Userfaultfd context wrapper object of    mmap write,
>+                                                           type :c:type:`!vm_userfaultfd_ctx`,      VMA write.
>+                                                           either of zero size if userfaultfd is
>+                                                           disabled, or containing a pointer
>+                                                           to an underlying
>+							   :c:type:`!userfaultfd_ctx` object which
>+                                                           describes userfaultfd metadata.
>+   ================================= ===================== ======================================== ===============
>+
>+These fields are present or not depending on whether the relevant kernel
>+configuration option is set.
>+
>+.. table:: Reverse mapping fields
>+
>+   =================================== ========================================= ================
>+   Field                               Description                               Write lock
>+   =================================== ========================================= ================
>+   :c:member:`!shared.rb`              A red/black tree node used, if the        mmap write,
>+                                       mapping is file-backed, to place the VMA  VMA write,
>+                                       in the                                    i_mmap write.
>+                                       :c:member:`!struct address_space->i_mmap`
>+				       red/black interval tree.
>+   :c:member:`!shared.rb_subtree_last` Metadata used for management of the
>+                                       interval tree if the VMA is file-backed.  mmap write,
>+                                                                                 VMA write,
>+                                                                                 i_mmap write.
>+   :c:member:`!anon_vma_chain`         List of links to forked/CoW’d anon_vma    mmap read,
>+                                       objects.                                  anon_vma write.
>+   :c:member:`!anon_vma`               :c:type:`!anon_vma` object used by        mmap_read,
>+                                       anonymous folios mapped exclusively to    page_table_lock.
>+				       this VMA.
>+   =================================== ========================================= ================
>+
>+These fields are used to both place the VMA within the reverse mapping, and for
>+anonymous mappings, to be able to access both related :c:struct:`!struct anon_vma` objects
>+and the :c:struct:`!struct anon_vma` which folios mapped exclusively to this VMA should
>+reside.
>+
>+Page tables
>+-----------
>+
>+We won't speak exhaustively on the subject but broadly speaking, page tables map
>+virtual addresses to physical ones through a series of page tables, each of
>+which contain entries with physical addresses for the next page table level
>+(along with flags), and at the leaf level the physical addresses of the
>+underlying physical data pages (with offsets into these pages provided by the
>+virtual address itself).
>+
>+In Linux these are divided into five levels - PGD, P4D, PUD, PMD and PTE. Huge
>+pages might eliminate one or two of these levels, but when this is the case we
>+typically refer to the leaf level as the PTE level regardless.
>+
>+.. note:: In instances where the architecture supports fewer page tables than
>+   five the kernel cleverly 'folds' page table levels, that is skips them within
>+   the logic, regardless we can act as if there were always five.
>+
>+There are three key operations typically performed on page tables:
>+
>+1. **Installing** page table mappings - whether creating a new mapping or
>+   modifying an existing one.
>+2. **Zapping/unmapping** page tables - This is what the kernel calls clearing page
>+   table mappings at the leaf level only, whilst leaving all page tables in
>+   place. This is a very common operation in the kernel performed on file
>+   truncation, the :c:macro:`!MADV_DONTNEED` operation via :c:func:`!madvise`,
>+   and others. This is performed by a number of functions including
>+   :c:func:`!unmap_mapping_range`, :c:func:`!unmap_mapping_pages` and reverse
>+   mapping logic.
>+3. **Freeing** page tables - When finally the kernel removes page tables from a
>+   userland process (typically via :c:func:`!free_pgtables`) extreme care must
>+   be taken to ensure this is done safely, as this logic finally frees all page
>+   tables in the specified range, taking no care whatsoever with existing
>+   mappings (it assumes the caller has both zapped the range and prevented any
>+   further faults within it).
>+
>+For most kernel developers, cases 1 and 3 are transparent memory management
>+implementation details that are handled behind the scenes for you (we explore
>+these details below in the implementation section).
>+
>+When **zapping** ranges, this can be done holding any one of the locks described
>+in the terminology section above - that is the mmap lock, the VMA lock or either
>+of the reverse mapping locks.
>+
>+That is - as long as you keep the relevant VMA **stable**, you are good to go
>+ahead and zap memory in that VMA's range.
>+
>+.. warning:: When **freeing** page tables, it must not be possible for VMAs
>+	     containing the ranges those page tables map to be accessible via
>+	     the reverse mapping.
>+
>+	     The :c:func:`!free_pgtables` function removes the relevant VMAs
>+	     from the reverse mappings, but no other VMAs can be permitted to be
>+	     accessible and span the specified range.
>+
>+Lock ordering
>+-------------
>+
>+As we have multiple locks across the kernel which may or may not be taken at the
>+same time as explicit mm or VMA locks, we have to be wary of lock inversion, and
>+the **order** in which locks are acquired and released becomes very important.
>+
>+.. note:: Lock inversion occurs when two threads need to acquire multiple locks,
>+   but in doing so inadvertently cause a mutual deadlock.
>+
>+   For example, consider thread 1 which holds lock A and tries to acquire lock B,
>+   while thread 2 holds lock B and tries to acquire lock A.
>+
>+   Both threads are now deadlocked on each other. However, had they attempted to
>+   acquire locks in the same order, one would have waited for the other to
>+   complete its work and no deadlock would have occurred.
>+
>+The opening comment in `mm/rmap.c` describes in detail the required ordering of
>+locks within memory management code:
>+
>+.. code-block::
>+
>+  inode->i_rwsem	(while writing or truncating, not reading or faulting)
>+    mm->mmap_lock
>+      mapping->invalidate_lock (in filemap_fault)
>+        folio_lock
>+          hugetlbfs_i_mmap_rwsem_key (in huge_pmd_share, see hugetlbfs below)
>+            vma_start_write
>+              mapping->i_mmap_rwsem
>+                anon_vma->rwsem
>+                  mm->page_table_lock or pte_lock
>+                    swap_lock (in swap_duplicate, swap_info_get)
>+                      mmlist_lock (in mmput, drain_mmlist and others)
>+                      mapping->private_lock (in block_dirty_folio)
>+                          i_pages lock (widely used)
>+                            lruvec->lru_lock (in folio_lruvec_lock_irq)
>+                      inode->i_lock (in set_page_dirty's __mark_inode_dirty)
>+                      bdi.wb->list_lock (in set_page_dirty's __mark_inode_dirty)
>+                        sb_lock (within inode_lock in fs/fs-writeback.c)
>+                        i_pages lock (widely used, in set_page_dirty,
>+                                  in arch-dependent flush_dcache_mmap_lock,
>+                                  within bdi.wb->list_lock in __sync_single_inode)
>+
>+Please check the current state of this comment which may have changed since the
>+time of writing of this document.
>+
>+------------------------------
>+Locking Implementation Details
>+------------------------------
>+
>+Page table locking details
>+--------------------------
>+
>+In addition to the locks described in the terminology section above, we have
>+additional locks dedicated to page tables:
>+
>+* **Higher level page table locks** - Higher level page tables, that is PGD, P4D
>+  and PUD each make use of the process address space granularity
>+  :c:member:`!mm->page_table_lock` lock when modified.
>+
>+* **Fine-grained page table locks** - PMDs and PTEs each have fine-grained locks
>+  either kept within the folios describing the page tables or allocated
>+  separated and pointed at by the folios if :c:macro:`!ALLOC_SPLIT_PTLOCKS` is
>+  set. The PMD spin lock is obtained via :c:func:`!pmd_lock`, however PTEs are
>+  mapped into higher memory (if a 32-bit system) and carefully locked via
>+  :c:func:`!pte_offset_map_lock`.
>+
>+These locks represent the minimum required to interact with each page table
>+level, but there are further requirements.
>+
>+Locking rules
>+^^^^^^^^^^^^^
>+
>+We establish basic locking rules when interacting with page tables:
>+
>+* When changing a page table entry the page table lock for that page table
>+  **must** be held.
>+* Reads from and writes to page table entries must be appropriately atomic. See
>+  the section on atomicity below.
>+* Populating previously empty entries requires that the mmap or VMA locks are
>+  held, doing so with only rmap locks would risk a race with unmapping logic
>+  invoking :c:func:`!unmap_vmas`, so is forbidden.
>+* As mentioned above, zapping can be performed while simply keeping the VMA
>+  stable, that is holding any one of the mmap, VMA or rmap locks.
>+* Special care is required for PTEs, as on 32-bit architectures these must be
>+  mapped into high memory and additionally, careful consideration must be
>+  applied to racing with THP, migration or other concurrent kernel operations
>+  that might steal the entire PTE table from under us. All this is handled by
>+  :c:func:`!pte_offset_map_lock`.
>+
>+There are additional rules applicable when moving page tables, which we discuss
>+in the section on this topic below.
>+
>+.. note:: Interestingly, :c:func:`!pte_offset_map_lock` also maintains an RCU
>+          read lock over the mapping (and therefore combined mapping and
>+          locking) operation.
>+
>+Atomicity
>+^^^^^^^^^
>+
>+Page table entries must always be retrieved once and only once before being
>+interacted with, as we are operating concurrently with other operations and the
>+hardware.
>+
>+Regardless of page table locks, the MMU hardware will update accessed and dirty
>+bits (and in some architectures, perhaps more), and kernel functionality like
>+GUP-fast locklessly traverses page tables, so we cannot safely assume that page
>+table locks give us exclusive access.
>+
>+If we hold page table locks and are reading page table entries, then we need
>+only ensure that the compiler does not rearrange our loads. This is achieved via
>+:c:func:`!pXXp_get` functions - :c:func:`!pgdp_get`, :c:func:`!p4dp_get`,
>+:c:func:`!pudp_get`, :c:func:`!pmdp_get`, and :c:func:`!ptep_get`.
>+
>+Each of these uses :c:func:`!READ_ONCE` to guarantee that the compiler reads
>+the page table entry only once.
>+
>+However, if we wish to manipulate an existing page table entry and care about
>+the previously stored data, we must go further and use an hardware atomic
>+operation as, for example, in :c:func:`!ptep_get_and_clear`.
>+
>+Equally, operations that do not rely on the page table locks, such as GUP-fast
>+(for instance see :c:func:`!gup_fast` and its various page table level handlers
>+like :c:func:`!gup_fast_pte_range`), must very carefully interact with page
>+table entries, using functions such as :c:func:`!ptep_get_lockless` and
>+equivalent for higher page table levels.
>+
>+Writes to page table entries must also be appropriately atomic, as established
>+by :c:func:`!set_pXX` functions - :c:func:`!set_pgd`, :c:func:`!set_p4d`,
>+:c:func:`!set_pud`, :c:func:`!set_pmd`, and :c:func:`!set_pte`.
>+
>+
>+Page table installation
>+^^^^^^^^^^^^^^^^^^^^^^^
>+
>+When allocating a P4D, PUD or PMD and setting the relevant entry in the above
>+PGD, P4D or PUD, the :c:member:`!mm->page_table_lock` must be held. This is
>+acquired in :c:func:`!__p4d_alloc`, :c:func:`!__pud_alloc` and
>+:c:func:`!__pmd_alloc` respectively.
>+
>+.. note:: :c:func:`!__pmd_alloc` actually invokes :c:func:`!pud_lock` and
>+   :c:func:`!pud_lockptr` in turn, however at the time of writing it ultimately
>+   references the :c:member:`!mm->page_table_lock`.
>+
>+Allocating a PTE will either use the :c:member:`!mm->page_table_lock` or, if
>+:c:macro:`!USE_SPLIT_PMD_PTLOCKS` is defined, used a lock embedded in the PMD
>+physical page metadata in the form of a :c:struct:`!struct ptdesc`, acquired by
>+:c:func:`!pmd_ptdesc` called from :c:func:`!pmd_lock` and ultimately
>+:c:func:`!__pte_alloc`.
>+
>+Finally, modifying the contents of the PTE has special treatment, as this is a
>+lock that we must acquire whenever we want stable and exclusive access to
>+entries pointing to data pages within a PTE, especially when we wish to modify
>+them.
>+
>+This is performed via :c:func:`!pte_offset_map_lock` which carefully checks to
>+ensure that the PTE hasn't changed from under us, ultimately invoking
>+:c:func:`!pte_lockptr` to obtain a spin lock at PTE granularity contained within
>+the :c:struct:`!struct ptdesc` associated with the physical PTE page. The lock
>+must be released via :c:func:`!pte_unmap_unlock`.
>+
>+.. note:: There are some variants on this, such as
>+   :c:func:`!pte_offset_map_rw_nolock` when we know we hold the PTE stable but
>+   for brevity we do not explore this.  See the comment for
>+   :c:func:`!__pte_offset_map_lock` for more details.
>+
>+When modifying data in ranges we typically only wish to allocate higher page
>+tables as necessary, using these locks to avoid races or overwriting anything,
>+and set/clear data at the PTE level as required (for instance when page faulting
>+or zapping).
>+
>+Page table freeing
>+^^^^^^^^^^^^^^^^^^
>+
>+Tearing down page tables themselves is something that requires significant
>+care. There must be no way that page tables designated for removal can be
>+traversed or referenced by concurrent tasks.
>+
>+It is insufficient to simply hold an mmap write lock and VMA lock (which will
>+prevent racing faults, and rmap operations), as a file-backed mapping can be
>+truncated under the :c:struct:`!struct address_space` i_mmap_lock alone.
>+
>+As a result, no VMA which can be accessed via the reverse mapping (either
>+anon_vma or the :c:member:`!struct address_space->i_mmap` interval tree) can
>+have its page tables torn down.
>+
>+The operation is typically performed via :c:func:`!free_pgtables`, which assumes
>+either the mmap write lock has been taken (as specified by its
>+:c:member:`!mm_wr_locked` parameter), or that the VMA is already unreachable.
>+
>+It carefully removes the VMA from all reverse mappings, however it's important
>+that no new ones overlap these or any route remain to permit access to addresses
>+within the range whose page tables are being torn down.
>+
>+As a result of these careful conditions, note that page table entries are
>+cleared without page table locks, as it is assumed that all of these precautions
>+have already been taken (in the :c:func:`!pgd_clear`, :c:func:`!p4d_clear`,
>+:c:func:`!pud_clear`, and :c:func:`!pmd_clear` functions - note that at this
>+stage it is assumed that PTE entries have been zapped).
>+
>+.. note:: It is possible for leaf page tables to be torn down, independent of
>+          the page tables above it, as is done by
>+          :c:func:`!retract_page_tables`, which is performed under the i_mmap
>+          read lock, PMD, and PTE page table locks, without this level of care.
>+
>+Page table moving
>+^^^^^^^^^^^^^^^^^
>+
>+Some functions manipulate page table levels above PMD (that is PUD, P4D and PGD
>+page tables). Most notable of these is :c:func:`!mremap`, which is capable of
>+moving higher level page tables.
>+
>+In these instances, it is either required that **all** locks are taken, that is
>+the mmap lock, the VMA lock and the relevant rmap lock, or that the mmap lock
>+and VMA locks are taken and some other measure is taken to avoid rmap races (see
>+the comment in :c:func:`!move_ptes` in the :c:func:`!mremap` implementation for
>+details of how this is handled in this instance).
>+
>+You can observe that in the :c:func:`!mremap` implementation in the functions
>+:c:func:`!take_rmap_locks` and :c:func:`!drop_rmap_locks` which perform the rmap
>+side of lock acquisition, invoked ultimately by :c:func:`!move_page_tables`.
>+
>+VMA lock internals
>+------------------
>+
>+This kind of locking is entirely optimistic - if the lock is contended or a
>+competing write has started, then we do not obtain a read lock.
>+
>+The :c:func:`!lock_vma_under_rcu` function first calls :c:func:`!rcu_read_lock`
>+to ensure that the VMA is acquired in an RCU critical section, then attempts to
>+VMA lock it via :c:func:`!vma_start_read`, before releasing the RCU lock via
>+:c:func:`!rcu_read_unlock`.
>+
>+VMA read locks hold the read lock on the :c:member:`!vma->vm_lock` semaphore for
>+their duration and the caller of :c:func:`!lock_vma_under_rcu` must release it
>+via :c:func:`!vma_end_read`.
>+
>+VMA **write** locks are acquired via :c:func:`!vma_start_write` in instances where a
>+VMA is about to be modified, unlike :c:func:`!vma_start_read` the lock is always
>+acquired. An mmap write lock **must** be held for the duration of the VMA write
>+lock, releasing or downgrading the mmap write lock also releases the VMA write
>+lock so there is no :c:func:`!vma_end_write` function.
>+
>+Note that a semaphore write lock is not held across a VMA lock. Rather, a
>+sequence number is used for serialisation, and the write semaphore is only
>+acquired at the point of write lock to update this.
>+
>+This ensures the semantics we require - VMA write locks provide exclusive write
>+access to the VMA.
>+
>+The VMA lock mechanism is designed to be a lightweight means of avoiding the use
>+of the heavily contended mmap lock. It is implemented using a combination of a
>+read/write semaphore and sequence numbers belonging to the containing
>+:c:struct:`!struct mm_struct` and the VMA.
>+
>+Read locks are acquired via :c:func:`!vma_start_read`, which is an optimistic
>+operation, i.e. it tries to acquire a read lock but returns false if it is
>+unable to do so. At the end of the read operation, :c:func:`!vma_end_read` is
>+called to release the VMA read lock. This can be done under RCU alone.
>+
>+Writing requires the mmap to be write-locked and the VMA lock to be acquired via
>+:c:func:`!vma_start_write`, however the write lock is released by the termination or
>+downgrade of the mmap write lock so no :c:func:`!vma_end_write` is required.
>+
>+All this is achieved by the use of per-mm and per-VMA sequence counts, which are
>+used in order to reduce complexity, especially for operations which write-lock
>+multiple VMAs at once.
>+

Hi, Lorenzo

I got a question more than a comment when trying to understand the mechanism.

I am thinking about the benefit of PER_VMA_LOCK.

For write, we always need to grab mmap_lock w/o PER_VMA_LOCK.
For read, seems we don't need to grab mmap_lock with PER_VMA_LOCK. So read
operation benefit the most with PER_VMA_LOCK, right?

>+If the mm sequence count, :c:member:`!mm->mm_lock_seq` is equal to the VMA
>+sequence count :c:member:`!vma->vm_lock_seq` then the VMA is write-locked. If
>+they differ, then they are not.
>+
>+Each time an mmap write lock is acquired in :c:func:`!mmap_write_lock`,
>+:c:func:`!mmap_write_lock_nested`, :c:func:`!mmap_write_lock_killable`, the
>+:c:member:`!mm->mm_lock_seq` sequence number is incremented via
>+:c:func:`!mm_lock_seqcount_begin`.
>+
>+Each time the mmap write lock is released in :c:func:`!mmap_write_unlock` or
>+:c:func:`!mmap_write_downgrade`, :c:func:`!vma_end_write_all` is invoked which
>+also increments :c:member:`!mm->mm_lock_seq` via
>+:c:func:`!mm_lock_seqcount_end`.
>+
>+This way, we ensure regardless of the VMA's sequence number count, that a write
>+lock is not incorrectly indicated (since we increment the sequence counter on
>+acquiring the mmap write lock, which is required in order to obtain a VMA write
>+lock), and that when we release an mmap write lock, we efficiently release
>+**all** VMA write locks contained within the mmap at the same time.
>+
>+The exclusivity of the mmap write lock ensures this is what we want, as there
>+would never be a reason to persist per-VMA write locks across multiple mmap
>+write lock acquisitions.
>+
>+Each time a VMA read lock is acquired, we acquire a read lock on the
>+:c:member:`!vma->vm_lock` read/write semaphore and hold it, while checking that
>+the sequence count of the VMA does not match that of the mm.
>+
>+If it does, the read lock fails. If it does not, we hold the lock, excluding
>+writers, but permitting other readers, who will also obtain this lock under RCU.
>+
>+Importantly, maple tree operations performed in :c:func:`!lock_vma_under_rcu`
>+are also RCU safe, so the whole read lock operation is guaranteed to function
>+correctly.
>+
>+On the write side, we acquire a write lock on the :c:member:`!vma->vm_lock`
>+read/write semaphore, before setting the VMA's sequence number under this lock,
>+also simultaneously holding the mmap write lock.
>+
>+This way, if any read locks are in effect, :c:func:`!vma_start_write` will sleep
>+until these are finished and mutual exclusion is achieved.
>+
>+After setting the VMA's sequence number, the lock is released, avoiding
>+complexity with a long-term held write lock.
>+
>+This clever combination of a read/write semaphore and sequence count allows for
>+fast RCU-based per-VMA lock acquisition (especially on page fault, though
>+utilised elsewhere) with minimal complexity around lock ordering.
>+
>+mmap write lock downgrading
>+---------------------------
>+
>+When an mmap write lock is held, one has exclusive access to resources within
>+the mmap (with the usual caveats about requiring VMA write locks to avoid races
>+with tasks holding VMA read locks).
>+
>+It is then possible to **downgrade** from a write lock to a read lock via
>+:c:func:`!mmap_write_downgrade` which, similar to :c:func:`!mmap_write_unlock`,
>+implicitly terminates all VMA write locks via :c:func:`!vma_end_write_all`, but
>+importantly does not relinquish the mmap lock while downgrading, therefore
>+keeping the locked virtual address space stable.
>+
>+An interesting consequence of this is that downgraded locks will be exclusive
>+against any other task possessing a downgraded lock (since they'd have to
>+acquire a write lock first to do so, and the lock now being a read lock prevents
>+this).
>+
>+For clarity, mapping read (R)/downgraded write (D)/write (W) locks against one
>+another showing which locks exclude the others:
>+
>+.. list-table:: Lock exclusivity
>+   :widths: 5 5 5 5
>+   :header-rows: 1
>+   :stub-columns: 1
>+
>+   * -
>+     - R
>+     - D
>+     - W
>+   * - R
>+     - N
>+     - N
>+     - Y
>+   * - D
>+     - N
>+     - Y
>+     - Y
>+   * - W
>+     - Y
>+     - Y
>+     - Y
>+
>+Here a Y indicates the locks in the matching row/column exclude one another, and
>+N indicates that they do not.
>+
>+Stack expansion
>+---------------
>+
>+Stack expansion throws up additional complexities in that we cannot permit there
>+to be racing page faults, as a result we invoke :c:func:`!vma_start_write` to
>+prevent this in :c:func:`!expand_downwards` or :c:func:`!expand_upwards`.
>--
>2.47.0
Wei Yang Nov. 11, 2024, 7:07 a.m. UTC | #9
On Thu, Nov 07, 2024 at 07:01:37PM +0000, Lorenzo Stoakes wrote:
>Andrew - As mm-specific docs were brought under the mm tree in a recent
>change to MAINTAINERS I believe this ought to go through your tree?
>
>
>Locking around VMAs is complicated and confusing. While we have a number of
>disparate comments scattered around the place, we seem to be reaching a
>level of complexity that justifies a serious effort at clearly documenting
>how locks are expected to be used when it comes to interacting with
>mm_struct and vm_area_struct objects.
>
>This is especially pertinent as regards the efforts to find sensible
>abstractions for these fundamental objects in kernel rust code whose
>compiler strictly requires some means of expressing these rules (and
>through this expression, self-document these requirements as well as
>enforce them).
>
>The document limits scope to mmap and VMA locks and those that are
>immediately adjacent and relevant to them - so additionally covers page
>table locking as this is so very closely tied to VMA operations (and relies
>upon us handling these correctly).
>
>The document tries to cover some of the nastier and more confusing edge
>cases and concerns especially around lock ordering and page table teardown.
>
>The document is split between generally useful information for users of mm
>interfaces, and separately a section intended for mm kernel developers
>providing a discussion around internal implementation details.
>
>Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>---
>
>REVIEWERS NOTES:
>* Apologies if I missed any feedback, I believe I have taken everything
>  into account but do let me know if I missed anything.
>* As before, for convenience, I've uploaded a render of this document to my
>  website at https://ljs.io/output/mm/process_addrs
>* You can speed up doc builds by running `make SPHINXDIRS=mm htmldocs`.
>
>v1:
>* Removed RFC tag as I think we are iterating towards something workable
>  and there is interest.
>* Cleaned up and sharpened the language, structure and layout. Separated
>  into top-level details and implementation sections as per Alice.
>* Replaced links with rather more readable formatting.
>* Improved valid mmap/VMA lock state table.
>* Put VMA locks section into the process addresses document as per SJ and
>  Mike.
>* Made clear as to read/write operations against VMA object rather than
>  userland memory, as per Mike's suggestion, also that it does not refer to
>  page tables as per Jann.
>* Moved note into main section as per Mike's suggestion.
>* Fixed grammar mistake as per Mike.
>* Converted list-table to table as per Mike.
>* Corrected various typos as per Jann, Suren.
>* Updated reference to page fault arches as per Jann.
>* Corrected mistaken write lock criteria for vm_lock_seq as per Jann.
>* Updated vm_pgoff description to reference CONFIG_ARCH_HAS_PTE_SPECIAL as
>  per Jann.
>* Updated write lock to mmap read for vma->numab_state as per Jann.
>* Clarified that the write lock is on the mmap and VMA lock at VMA
>  granularity earlier in description as per Suren.
>* Added explicit note at top of VMA lock section to explicitly highlight
>  VMA lock semantics as per Suren.
>* Updated required locking for vma lock fields to N/A to avoid confusion as
>  per Suren.
>* Corrected description of mmap_downgrade() as per Suren.
>* Added a note on gate VMAs as per Jann.
>* Explained that taking mmap read lock under VMA lock is a bad idea due to
>  deadlock as per Jann.
>* Discussed atomicity in page table operations as per Jann.
>* Adapted the well thought out page table locking rules as provided by Jann.
>* Added a comment about pte mapping maintaining an RCU read lock.
>* Added clarification on moving page tables as informed by Jann's comments
>  (though it turns out mremap() doesn't necessarily hold all locks if it
>  can resolve races other ways :)
>* Added Jann's diagram showing lock exclusivity characteristics.
>
>RFC:
>https://lore.kernel.org/all/20241101185033.131880-1-lorenzo.stoakes@oracle.com/
>
> Documentation/mm/process_addrs.rst | 678 +++++++++++++++++++++++++++++
> 1 file changed, 678 insertions(+)
>
>diff --git a/Documentation/mm/process_addrs.rst b/Documentation/mm/process_addrs.rst
>index e8618fbc62c9..a01a7bcf39ff 100644
>--- a/Documentation/mm/process_addrs.rst
>+++ b/Documentation/mm/process_addrs.rst
>@@ -3,3 +3,681 @@
> =================
> Process Addresses
> =================
>+
>+.. toctree::
>+   :maxdepth: 3
>+
>+
>+Userland memory ranges are tracked by the kernel via Virtual Memory Areas or
>+'VMA's of type :c:struct:`!struct vm_area_struct`.
>+
>+Each VMA describes a virtually contiguous memory range with identical
>+attributes, each of which described by a :c:struct:`!struct vm_area_struct`
>+object. Userland access outside of VMAs is invalid except in the case where an
>+adjacent stack VMA could be extended to contain the accessed address.
>+
>+All VMAs are contained within one and only one virtual address space, described
>+by a :c:struct:`!struct mm_struct` object which is referenced by all tasks (that is,
>+threads) which share the virtual address space. We refer to this as the
>+:c:struct:`!mm`.
>+
>+Each mm object contains a maple tree data structure which describes all VMAs
>+within the virtual address space.
>+
>+.. note:: An exception to this is the 'gate' VMA which is provided for
>+	  architectures which use :c:struct:`!vsyscall` and is a global static
>+	  object which does not belong to any specific mm.
>+
>+-------
>+Locking
>+-------
>+
>+The kernel is designed to be highly scalable against concurrent read operations
>+on VMA **metadata** so a complicated set of locks are required to ensure memory
>+corruption does not occur.
>+
>+.. note:: Locking VMAs for their metadata does not have any impact on the memory
>+	  they describe or the page tables that map them.
>+
>+Terminology
>+-----------
>+
>+* **mmap locks** - Each MM has a read/write semaphore `mmap_lock` which locks at
>+  a process address space granularity which can be acquired via
>+  :c:func:`!mmap_read_lock`, :c:func:`!mmap_write_lock` and variants.
>+* **VMA locks** - The VMA lock is at VMA granularity (of course) which behaves
>+  as a read/write semaphore in practice. A VMA read lock is obtained via
>+  :c:func:`!lock_vma_under_rcu` (and unlocked via :c:func:`!vma_end_read`) and a
>+  write lock via :c:func:`!vma_start_write` (all VMA write locks are unlocked
>+  automatically when the mmap write lock is released). To take a VMA write lock
>+  you **must** have already acquired an :c:func:`!mmap_write_lock`.
>+* **rmap locks** - When trying to access VMAs through the reverse mapping via a
>+  :c:struct:`!struct address_space *` or :c:struct:`!struct anon_vma *` object
>+  (each obtainable from a folio), VMAs must be stabilised via
>+  :c:func:`!anon_vma_[try]lock_read` or :c:func:`!anon_vma_[try]lock_write` for
>+  anonymous memory and :c:func:`!i_mmap_[try]lock_read` or
>+  :c:func:`!i_mmap_[try]lock_write` for file-backed memory. We refer to these
>+  locks as the reverse mapping locks, or 'rmap locks' for brevity.
>+
>+We discuss page table locks separately in the dedicated section below.
>+
>+The first thing **any** of these locks achieve is to **stabilise** the VMA
>+within the MM tree. That is, guaranteeing that the VMA object will not be
>+deleted from under you nor modified (except for some specific exceptions
>+describe below).
>+
>+Stabilising a VMA also keeps the address space described by it around.
>+
>+Using address space locks
>+-------------------------
>+
>+If you want to **read** VMA metadata fields or just keep the VMA stable, you
>+must do one of the following:
>+
>+* Obtain an mmap read lock at the MM granularity via :c:func:`!mmap_read_lock` (or a
>+  suitable variant), unlocking it with a matching :c:func:`!mmap_read_unlock` when
>+  you're done with the VMA, *or*
>+* Try to obtain a VMA read lock via :c:func:`!lock_vma_under_rcu`. This tries to
>+  acquire the lock atomically so might fail, in which case fall-back logic is
>+  required to instead obtain an mmap read lock if this returns :c:macro:`!NULL`,
>+  *or*
>+* Acquire an rmap lock before traversing the locked interval tree (whether
>+  anonymous or file-backed) to obtain the required VMA.
>+
>+If you want to **write** VMA metadata fields, then things vary depending on the
>+field (we explore each VMA field in detail below). For the majority you must:
>+
>+* Obtain an mmap write lock at the MM granularity via :c:func:`!mmap_write_lock` (or a
>+  suitable variant), unlocking it with a matching :c:func:`!mmap_write_unlock` when
>+  you're done with the VMA, *and*
>+* Obtain a VMA write lock via :c:func:`!vma_start_write` for each VMA you wish to
>+  modify, which will be released automatically when :c:func:`!mmap_write_unlock` is
>+  called.
>+* If you want to be able to write to **any** field, you must also hide the VMA
>+  from the reverse mapping by obtaining an **rmap write lock**.

Here I am a little confused.

I guess it wants to say mmap write lock and VMA write lock should be obtained,
while rmap write lock is optional for write operation. And this is what
illustrated in below section "VMA fields".

But here says 'write to "any" field'. This seems not what it shows below.

Or maybe I misunderstand this part.

>+
>+VMA locks are special in that you must obtain an mmap **write** lock **first**
>+in order to obtain a VMA **write** lock. A VMA **read** lock however can be
>+obtained under an RCU lock alone.
>+
>+.. note:: The primary users of VMA read locks are page fault handlers, which
>+	  means that without a VMA write lock, page faults will run concurrent with
>+	  whatever you are doing.
>+
>+Examining all valid lock states:
>+
>+.. table::
>+
>+   ========= ======== ========= ======= ===== =========== ==========
>+   mmap lock VMA lock rmap lock Stable? Read? Write most? Write all?
>+   ========= ======== ========= ======= ===== =========== ==========
>+   \-        \-       \-        N       N     N           N
>+   \-        R        \-        Y       Y     N           N
>+   \-        \-       R/W       Y       Y     N           N
>+   R/W       \-/R     \-/R/W    Y       Y     N           N
>+   W         W        \-/R      Y       Y     Y           N
>+   W         W        W         Y       Y     Y           Y
>+   ========= ======== ========= ======= ===== =========== ==========
>+
>+.. warning:: While it's possible to obtain a VMA lock while holding an mmap read lock,
>+	     attempting to do the reverse is invalid as it can result in deadlock - if
>+	     another task already holds an mmap write lock and attempts to acquire a VMA
>+	     write lock that will deadlock on the VMA read lock.
>+
>+All of these locks behave as read/write semaphores in practice, so you can
>+obtain either a read or a write lock for both.
>+
>+.. note:: Generally speaking, a read/write semaphore is a class of lock which
>+	  permits concurrent readers. However a write lock can only be obtained
>+	  once all readers have left the critical region (and pending readers
>+	  made to wait).
>+
>+	  This renders read locks on a read/write semaphore concurrent with other
>+	  readers and write locks exclusive against all others holding the semaphore.
>+
>+VMA fields
>+^^^^^^^^^^
>+
>+We can subdivide :c:struct:`!struct vm_area_struct` fields by their purpose, which makes it
>+easier to explore their locking characteristics:
>+
>+.. note:: We exclude VMA lock-specific fields here to avoid confusion, as these
>+	  are in effect an internal implementation detail.
>+
>+.. table:: Virtual layout fields
>+
>+   ===================== ======================================== ===========
>+   Field                 Description                              Write lock
>+   ===================== ======================================== ===========
>+   :c:member:`!vm_start` Inclusive start virtual address of range mmap write,
>+                         VMA describes.                           VMA write,
>+                                                                  rmap write.
>+   :c:member:`!vm_end`   Exclusive end virtual address of range   mmap write,
>+                         VMA describes.                           VMA write,
>+                                                                  rmap write.
>+   :c:member:`!vm_pgoff` Describes the page offset into the file, rmap write.

                                                                    ^
I am afraid this one is a typo?

>+                         the original page offset within the      mmap write,
>+                         virtual address space (prior to any      rmap write.
>+                         :c:func:`!mremap`), or PFN if a PFN map
>+                         and the architecture does not support
>+                         :c:macro:`!CONFIG_ARCH_HAS_PTE_SPECIAL`.
>+   ===================== ======================================== ===========
>+
>+These fields describes the size, start and end of the VMA, and as such cannot be
>+modified without first being hidden from the reverse mapping since these fields
>+are used to locate VMAs within the reverse mapping interval trees.
>+
>+.. table:: Core fields
>+
>+   ============================ ======================================== =========================
>+   Field                        Description                              Write lock
>+   ============================ ======================================== =========================
>+   :c:member:`!vm_mm`           Containing mm_struct.                    None - written once on
>+                                                                         initial map.
>+   :c:member:`!vm_page_prot`    Architecture-specific page table         mmap write, VMA write.
>+                                protection bits determined from VMA
>+                                flags
>+   :c:member:`!vm_flags`        Read-only access to VMA flags describing N/A
>+                                attributes of the VMA, in union with
>+                                private writable
>+				:c:member:`!__vm_flags`.
>+   :c:member:`!__vm_flags`      Private, writable access to VMA flags    mmap write, VMA write.
>+                                field, updated by
>+                                :c:func:`!vm_flags_*` functions.
>+   :c:member:`!vm_file`         If the VMA is file-backed, points to a   None - written once on
>+                                struct file object describing the        initial map.
>+                                underlying file, if anonymous then
>+				:c:macro:`!NULL`.
>+   :c:member:`!vm_ops`          If the VMA is file-backed, then either   None - Written once on
>+                                the driver or file-system provides a     initial map by
>+                                :c:struct:`!struct vm_operations_struct` :c:func:`!f_ops->mmap()`.
>+				object describing callbacks to be
>+                                invoked on VMA lifetime events.
>+   :c:member:`!vm_private_data` A :c:member:`!void *` field for          Handled by driver.
>+                                driver-specific metadata.
>+   ============================ ======================================== =========================
>+
>+These are the core fields which describe the MM the VMA belongs to and its attributes.
>+
>+.. table:: Config-specific fields
>+
>+   ================================= ===================== ======================================== ===============
>+   Field                             Configuration option  Description                              Write lock
>+   ================================= ===================== ======================================== ===============
>+   :c:member:`!anon_name`            CONFIG_ANON_VMA_NAME  A field for storing a                    mmap write,
>+                                                           :c:struct:`!struct anon_vma_name`        VMA write.
>+                                                           object providing a name for anonymous
>+                                                           mappings, or :c:macro:`!NULL` if none
>+							   is set or the VMA is file-backed.
>+   :c:member:`!swap_readahead_info`  CONFIG_SWAP           Metadata used by the swap mechanism      mmap read.
>+                                                           to perform readahead.
>+   :c:member:`!vm_policy`            CONFIG_NUMA           :c:type:`!mempolicy` object which        mmap write,
>+                                                           describes the NUMA behaviour of the      VMA write.
>+							   VMA.
>+   :c:member:`!numab_state`          CONFIG_NUMA_BALANCING :c:type:`!vma_numab_state` object which  mmap read.
>+                                                           describes the current state of
>+                                                           NUMA balancing in relation to this VMA.
>+                                                           Updated under mmap read lock by
>+							   :c:func:`!task_numa_work`.
>+   :c:member:`!vm_userfaultfd_ctx`   CONFIG_USERFAULTFD    Userfaultfd context wrapper object of    mmap write,
>+                                                           type :c:type:`!vm_userfaultfd_ctx`,      VMA write.
>+                                                           either of zero size if userfaultfd is
>+                                                           disabled, or containing a pointer
>+                                                           to an underlying
>+							   :c:type:`!userfaultfd_ctx` object which
>+                                                           describes userfaultfd metadata.
>+   ================================= ===================== ======================================== ===============
>+
>+These fields are present or not depending on whether the relevant kernel
>+configuration option is set.
>+
>+.. table:: Reverse mapping fields
>+
>+   =================================== ========================================= ================
>+   Field                               Description                               Write lock
>+   =================================== ========================================= ================
>+   :c:member:`!shared.rb`              A red/black tree node used, if the        mmap write,
>+                                       mapping is file-backed, to place the VMA  VMA write,
>+                                       in the                                    i_mmap write.
>+                                       :c:member:`!struct address_space->i_mmap`
>+				       red/black interval tree.
>+   :c:member:`!shared.rb_subtree_last` Metadata used for management of the
>+                                       interval tree if the VMA is file-backed.  mmap write,
>+                                                                                 VMA write,
>+                                                                                 i_mmap write.
>+   :c:member:`!anon_vma_chain`         List of links to forked/CoW’d anon_vma    mmap read,
>+                                       objects.                                  anon_vma write.
>+   :c:member:`!anon_vma`               :c:type:`!anon_vma` object used by        mmap_read,
>+                                       anonymous folios mapped exclusively to    page_table_lock.
>+				       this VMA.
>+   =================================== ========================================= ================
>+
>+These fields are used to both place the VMA within the reverse mapping, and for
>+anonymous mappings, to be able to access both related :c:struct:`!struct anon_vma` objects
>+and the :c:struct:`!struct anon_vma` which folios mapped exclusively to this VMA should
>+reside.
>+
>+Page tables
>+-----------
>+
>+We won't speak exhaustively on the subject but broadly speaking, page tables map
>+virtual addresses to physical ones through a series of page tables, each of
>+which contain entries with physical addresses for the next page table level
>+(along with flags), and at the leaf level the physical addresses of the
>+underlying physical data pages (with offsets into these pages provided by the
>+virtual address itself).
>+
>+In Linux these are divided into five levels - PGD, P4D, PUD, PMD and PTE. Huge
>+pages might eliminate one or two of these levels, but when this is the case we
>+typically refer to the leaf level as the PTE level regardless.
>+
>+.. note:: In instances where the architecture supports fewer page tables than
>+   five the kernel cleverly 'folds' page table levels, that is skips them within
>+   the logic, regardless we can act as if there were always five.
>+
>+There are three key operations typically performed on page tables:
>+
>+1. **Installing** page table mappings - whether creating a new mapping or
>+   modifying an existing one.
>+2. **Zapping/unmapping** page tables - This is what the kernel calls clearing page
>+   table mappings at the leaf level only, whilst leaving all page tables in
>+   place. This is a very common operation in the kernel performed on file
>+   truncation, the :c:macro:`!MADV_DONTNEED` operation via :c:func:`!madvise`,
>+   and others. This is performed by a number of functions including
>+   :c:func:`!unmap_mapping_range`, :c:func:`!unmap_mapping_pages` and reverse
>+   mapping logic.
>+3. **Freeing** page tables - When finally the kernel removes page tables from a
>+   userland process (typically via :c:func:`!free_pgtables`) extreme care must
>+   be taken to ensure this is done safely, as this logic finally frees all page
>+   tables in the specified range, taking no care whatsoever with existing
>+   mappings (it assumes the caller has both zapped the range and prevented any
>+   further faults within it).
>+
>+For most kernel developers, cases 1 and 3 are transparent memory management
>+implementation details that are handled behind the scenes for you (we explore
>+these details below in the implementation section).
>+
>+When **zapping** ranges, this can be done holding any one of the locks described
>+in the terminology section above - that is the mmap lock, the VMA lock or either
>+of the reverse mapping locks.
>+
>+That is - as long as you keep the relevant VMA **stable**, you are good to go
>+ahead and zap memory in that VMA's range.
>+
>+.. warning:: When **freeing** page tables, it must not be possible for VMAs
>+	     containing the ranges those page tables map to be accessible via
>+	     the reverse mapping.
>+
>+	     The :c:func:`!free_pgtables` function removes the relevant VMAs
>+	     from the reverse mappings, but no other VMAs can be permitted to be
>+	     accessible and span the specified range.
>+
>+Lock ordering
>+-------------
>+
>+As we have multiple locks across the kernel which may or may not be taken at the
>+same time as explicit mm or VMA locks, we have to be wary of lock inversion, and
>+the **order** in which locks are acquired and released becomes very important.
>+
>+.. note:: Lock inversion occurs when two threads need to acquire multiple locks,
>+   but in doing so inadvertently cause a mutual deadlock.
>+
>+   For example, consider thread 1 which holds lock A and tries to acquire lock B,
>+   while thread 2 holds lock B and tries to acquire lock A.
>+
>+   Both threads are now deadlocked on each other. However, had they attempted to
>+   acquire locks in the same order, one would have waited for the other to
>+   complete its work and no deadlock would have occurred.
>+
>+The opening comment in `mm/rmap.c` describes in detail the required ordering of
>+locks within memory management code:
>+
>+.. code-block::
>+
>+  inode->i_rwsem	(while writing or truncating, not reading or faulting)
>+    mm->mmap_lock
>+      mapping->invalidate_lock (in filemap_fault)
>+        folio_lock
>+          hugetlbfs_i_mmap_rwsem_key (in huge_pmd_share, see hugetlbfs below)
>+            vma_start_write
>+              mapping->i_mmap_rwsem
>+                anon_vma->rwsem
>+                  mm->page_table_lock or pte_lock
>+                    swap_lock (in swap_duplicate, swap_info_get)
>+                      mmlist_lock (in mmput, drain_mmlist and others)
>+                      mapping->private_lock (in block_dirty_folio)
>+                          i_pages lock (widely used)
>+                            lruvec->lru_lock (in folio_lruvec_lock_irq)
>+                      inode->i_lock (in set_page_dirty's __mark_inode_dirty)
>+                      bdi.wb->list_lock (in set_page_dirty's __mark_inode_dirty)
>+                        sb_lock (within inode_lock in fs/fs-writeback.c)
>+                        i_pages lock (widely used, in set_page_dirty,
>+                                  in arch-dependent flush_dcache_mmap_lock,
>+                                  within bdi.wb->list_lock in __sync_single_inode)
>+
>+Please check the current state of this comment which may have changed since the
>+time of writing of this document.
>+
>+------------------------------
>+Locking Implementation Details
>+------------------------------
>+
>+Page table locking details
>+--------------------------
>+
>+In addition to the locks described in the terminology section above, we have
>+additional locks dedicated to page tables:
>+
>+* **Higher level page table locks** - Higher level page tables, that is PGD, P4D
>+  and PUD each make use of the process address space granularity
>+  :c:member:`!mm->page_table_lock` lock when modified.
>+
>+* **Fine-grained page table locks** - PMDs and PTEs each have fine-grained locks
>+  either kept within the folios describing the page tables or allocated
>+  separated and pointed at by the folios if :c:macro:`!ALLOC_SPLIT_PTLOCKS` is
>+  set. The PMD spin lock is obtained via :c:func:`!pmd_lock`, however PTEs are
>+  mapped into higher memory (if a 32-bit system) and carefully locked via
>+  :c:func:`!pte_offset_map_lock`.
>+
>+These locks represent the minimum required to interact with each page table
>+level, but there are further requirements.
>+
>+Locking rules
>+^^^^^^^^^^^^^
>+
>+We establish basic locking rules when interacting with page tables:
>+
>+* When changing a page table entry the page table lock for that page table
>+  **must** be held.
>+* Reads from and writes to page table entries must be appropriately atomic. See
>+  the section on atomicity below.
>+* Populating previously empty entries requires that the mmap or VMA locks are
>+  held, doing so with only rmap locks would risk a race with unmapping logic
>+  invoking :c:func:`!unmap_vmas`, so is forbidden.
>+* As mentioned above, zapping can be performed while simply keeping the VMA
>+  stable, that is holding any one of the mmap, VMA or rmap locks.
>+* Special care is required for PTEs, as on 32-bit architectures these must be
>+  mapped into high memory and additionally, careful consideration must be
>+  applied to racing with THP, migration or other concurrent kernel operations
>+  that might steal the entire PTE table from under us. All this is handled by
>+  :c:func:`!pte_offset_map_lock`.
>+
>+There are additional rules applicable when moving page tables, which we discuss
>+in the section on this topic below.
>+
>+.. note:: Interestingly, :c:func:`!pte_offset_map_lock` also maintains an RCU
>+          read lock over the mapping (and therefore combined mapping and
>+          locking) operation.
>+
>+Atomicity
>+^^^^^^^^^
>+
>+Page table entries must always be retrieved once and only once before being
>+interacted with, as we are operating concurrently with other operations and the
>+hardware.
>+
>+Regardless of page table locks, the MMU hardware will update accessed and dirty
>+bits (and in some architectures, perhaps more), and kernel functionality like
>+GUP-fast locklessly traverses page tables, so we cannot safely assume that page
>+table locks give us exclusive access.
>+
>+If we hold page table locks and are reading page table entries, then we need
>+only ensure that the compiler does not rearrange our loads. This is achieved via
>+:c:func:`!pXXp_get` functions - :c:func:`!pgdp_get`, :c:func:`!p4dp_get`,
>+:c:func:`!pudp_get`, :c:func:`!pmdp_get`, and :c:func:`!ptep_get`.
>+
>+Each of these uses :c:func:`!READ_ONCE` to guarantee that the compiler reads
>+the page table entry only once.
>+
>+However, if we wish to manipulate an existing page table entry and care about
>+the previously stored data, we must go further and use an hardware atomic
>+operation as, for example, in :c:func:`!ptep_get_and_clear`.
>+
>+Equally, operations that do not rely on the page table locks, such as GUP-fast
>+(for instance see :c:func:`!gup_fast` and its various page table level handlers
>+like :c:func:`!gup_fast_pte_range`), must very carefully interact with page
>+table entries, using functions such as :c:func:`!ptep_get_lockless` and
>+equivalent for higher page table levels.
>+
>+Writes to page table entries must also be appropriately atomic, as established
>+by :c:func:`!set_pXX` functions - :c:func:`!set_pgd`, :c:func:`!set_p4d`,
>+:c:func:`!set_pud`, :c:func:`!set_pmd`, and :c:func:`!set_pte`.
>+
>+
>+Page table installation
>+^^^^^^^^^^^^^^^^^^^^^^^
>+
>+When allocating a P4D, PUD or PMD and setting the relevant entry in the above
>+PGD, P4D or PUD, the :c:member:`!mm->page_table_lock` must be held. This is
>+acquired in :c:func:`!__p4d_alloc`, :c:func:`!__pud_alloc` and
>+:c:func:`!__pmd_alloc` respectively.
>+
>+.. note:: :c:func:`!__pmd_alloc` actually invokes :c:func:`!pud_lock` and
>+   :c:func:`!pud_lockptr` in turn, however at the time of writing it ultimately
>+   references the :c:member:`!mm->page_table_lock`.
>+
>+Allocating a PTE will either use the :c:member:`!mm->page_table_lock` or, if
>+:c:macro:`!USE_SPLIT_PMD_PTLOCKS` is defined, used a lock embedded in the PMD
>+physical page metadata in the form of a :c:struct:`!struct ptdesc`, acquired by
>+:c:func:`!pmd_ptdesc` called from :c:func:`!pmd_lock` and ultimately
>+:c:func:`!__pte_alloc`.
>+
>+Finally, modifying the contents of the PTE has special treatment, as this is a
>+lock that we must acquire whenever we want stable and exclusive access to
>+entries pointing to data pages within a PTE, especially when we wish to modify
>+them.
>+
>+This is performed via :c:func:`!pte_offset_map_lock` which carefully checks to
>+ensure that the PTE hasn't changed from under us, ultimately invoking
>+:c:func:`!pte_lockptr` to obtain a spin lock at PTE granularity contained within
>+the :c:struct:`!struct ptdesc` associated with the physical PTE page. The lock
>+must be released via :c:func:`!pte_unmap_unlock`.
>+
>+.. note:: There are some variants on this, such as
>+   :c:func:`!pte_offset_map_rw_nolock` when we know we hold the PTE stable but
>+   for brevity we do not explore this.  See the comment for
>+   :c:func:`!__pte_offset_map_lock` for more details.
>+
>+When modifying data in ranges we typically only wish to allocate higher page
>+tables as necessary, using these locks to avoid races or overwriting anything,
>+and set/clear data at the PTE level as required (for instance when page faulting
>+or zapping).
>+
>+Page table freeing
>+^^^^^^^^^^^^^^^^^^
>+
>+Tearing down page tables themselves is something that requires significant
>+care. There must be no way that page tables designated for removal can be
>+traversed or referenced by concurrent tasks.
>+
>+It is insufficient to simply hold an mmap write lock and VMA lock (which will
>+prevent racing faults, and rmap operations), as a file-backed mapping can be
>+truncated under the :c:struct:`!struct address_space` i_mmap_lock alone.
>+
>+As a result, no VMA which can be accessed via the reverse mapping (either
>+anon_vma or the :c:member:`!struct address_space->i_mmap` interval tree) can
>+have its page tables torn down.
>+
>+The operation is typically performed via :c:func:`!free_pgtables`, which assumes
>+either the mmap write lock has been taken (as specified by its
>+:c:member:`!mm_wr_locked` parameter), or that the VMA is already unreachable.
>+
>+It carefully removes the VMA from all reverse mappings, however it's important
>+that no new ones overlap these or any route remain to permit access to addresses
>+within the range whose page tables are being torn down.
>+
>+As a result of these careful conditions, note that page table entries are
>+cleared without page table locks, as it is assumed that all of these precautions
>+have already been taken (in the :c:func:`!pgd_clear`, :c:func:`!p4d_clear`,
>+:c:func:`!pud_clear`, and :c:func:`!pmd_clear` functions - note that at this
>+stage it is assumed that PTE entries have been zapped).
>+
>+.. note:: It is possible for leaf page tables to be torn down, independent of
>+          the page tables above it, as is done by
>+          :c:func:`!retract_page_tables`, which is performed under the i_mmap
>+          read lock, PMD, and PTE page table locks, without this level of care.
>+
>+Page table moving
>+^^^^^^^^^^^^^^^^^
>+
>+Some functions manipulate page table levels above PMD (that is PUD, P4D and PGD
>+page tables). Most notable of these is :c:func:`!mremap`, which is capable of
>+moving higher level page tables.
>+
>+In these instances, it is either required that **all** locks are taken, that is
>+the mmap lock, the VMA lock and the relevant rmap lock, or that the mmap lock
>+and VMA locks are taken and some other measure is taken to avoid rmap races (see
>+the comment in :c:func:`!move_ptes` in the :c:func:`!mremap` implementation for
>+details of how this is handled in this instance).
>+
>+You can observe that in the :c:func:`!mremap` implementation in the functions
>+:c:func:`!take_rmap_locks` and :c:func:`!drop_rmap_locks` which perform the rmap
>+side of lock acquisition, invoked ultimately by :c:func:`!move_page_tables`.
>+
>+VMA lock internals
>+------------------
>+
>+This kind of locking is entirely optimistic - if the lock is contended or a
>+competing write has started, then we do not obtain a read lock.
>+
>+The :c:func:`!lock_vma_under_rcu` function first calls :c:func:`!rcu_read_lock`
>+to ensure that the VMA is acquired in an RCU critical section, then attempts to
>+VMA lock it via :c:func:`!vma_start_read`, before releasing the RCU lock via
>+:c:func:`!rcu_read_unlock`.
>+
>+VMA read locks hold the read lock on the :c:member:`!vma->vm_lock` semaphore for
>+their duration and the caller of :c:func:`!lock_vma_under_rcu` must release it
>+via :c:func:`!vma_end_read`.
>+
>+VMA **write** locks are acquired via :c:func:`!vma_start_write` in instances where a
>+VMA is about to be modified, unlike :c:func:`!vma_start_read` the lock is always
>+acquired. An mmap write lock **must** be held for the duration of the VMA write
>+lock, releasing or downgrading the mmap write lock also releases the VMA write
>+lock so there is no :c:func:`!vma_end_write` function.
>+
>+Note that a semaphore write lock is not held across a VMA lock. Rather, a
>+sequence number is used for serialisation, and the write semaphore is only
>+acquired at the point of write lock to update this.
>+
>+This ensures the semantics we require - VMA write locks provide exclusive write
>+access to the VMA.
>+
>+The VMA lock mechanism is designed to be a lightweight means of avoiding the use
>+of the heavily contended mmap lock. It is implemented using a combination of a
>+read/write semaphore and sequence numbers belonging to the containing
>+:c:struct:`!struct mm_struct` and the VMA.
>+
>+Read locks are acquired via :c:func:`!vma_start_read`, which is an optimistic
>+operation, i.e. it tries to acquire a read lock but returns false if it is
>+unable to do so. At the end of the read operation, :c:func:`!vma_end_read` is
>+called to release the VMA read lock. This can be done under RCU alone.
>+
>+Writing requires the mmap to be write-locked and the VMA lock to be acquired via
>+:c:func:`!vma_start_write`, however the write lock is released by the termination or
>+downgrade of the mmap write lock so no :c:func:`!vma_end_write` is required.
>+
>+All this is achieved by the use of per-mm and per-VMA sequence counts, which are
>+used in order to reduce complexity, especially for operations which write-lock
>+multiple VMAs at once.
>+
>+If the mm sequence count, :c:member:`!mm->mm_lock_seq` is equal to the VMA
>+sequence count :c:member:`!vma->vm_lock_seq` then the VMA is write-locked. If
>+they differ, then they are not.
>+
>+Each time an mmap write lock is acquired in :c:func:`!mmap_write_lock`,
>+:c:func:`!mmap_write_lock_nested`, :c:func:`!mmap_write_lock_killable`, the
>+:c:member:`!mm->mm_lock_seq` sequence number is incremented via
>+:c:func:`!mm_lock_seqcount_begin`.
>+
>+Each time the mmap write lock is released in :c:func:`!mmap_write_unlock` or
>+:c:func:`!mmap_write_downgrade`, :c:func:`!vma_end_write_all` is invoked which
>+also increments :c:member:`!mm->mm_lock_seq` via
>+:c:func:`!mm_lock_seqcount_end`.
>+
>+This way, we ensure regardless of the VMA's sequence number count, that a write
>+lock is not incorrectly indicated (since we increment the sequence counter on
>+acquiring the mmap write lock, which is required in order to obtain a VMA write
>+lock), and that when we release an mmap write lock, we efficiently release
>+**all** VMA write locks contained within the mmap at the same time.
>+
>+The exclusivity of the mmap write lock ensures this is what we want, as there
>+would never be a reason to persist per-VMA write locks across multiple mmap
>+write lock acquisitions.
>+
>+Each time a VMA read lock is acquired, we acquire a read lock on the
>+:c:member:`!vma->vm_lock` read/write semaphore and hold it, while checking that
>+the sequence count of the VMA does not match that of the mm.
>+
>+If it does, the read lock fails. If it does not, we hold the lock, excluding
>+writers, but permitting other readers, who will also obtain this lock under RCU.
>+
>+Importantly, maple tree operations performed in :c:func:`!lock_vma_under_rcu`
>+are also RCU safe, so the whole read lock operation is guaranteed to function
>+correctly.
>+
>+On the write side, we acquire a write lock on the :c:member:`!vma->vm_lock`
>+read/write semaphore, before setting the VMA's sequence number under this lock,
>+also simultaneously holding the mmap write lock.
>+
>+This way, if any read locks are in effect, :c:func:`!vma_start_write` will sleep
>+until these are finished and mutual exclusion is achieved.
>+
>+After setting the VMA's sequence number, the lock is released, avoiding
>+complexity with a long-term held write lock.
>+
>+This clever combination of a read/write semaphore and sequence count allows for
>+fast RCU-based per-VMA lock acquisition (especially on page fault, though
>+utilised elsewhere) with minimal complexity around lock ordering.
>+
>+mmap write lock downgrading
>+---------------------------
>+
>+When an mmap write lock is held, one has exclusive access to resources within
>+the mmap (with the usual caveats about requiring VMA write locks to avoid races
>+with tasks holding VMA read locks).
>+
>+It is then possible to **downgrade** from a write lock to a read lock via
>+:c:func:`!mmap_write_downgrade` which, similar to :c:func:`!mmap_write_unlock`,
>+implicitly terminates all VMA write locks via :c:func:`!vma_end_write_all`, but
>+importantly does not relinquish the mmap lock while downgrading, therefore
>+keeping the locked virtual address space stable.
>+
>+An interesting consequence of this is that downgraded locks will be exclusive
>+against any other task possessing a downgraded lock (since they'd have to
>+acquire a write lock first to do so, and the lock now being a read lock prevents
>+this).
>+
>+For clarity, mapping read (R)/downgraded write (D)/write (W) locks against one
>+another showing which locks exclude the others:
>+
>+.. list-table:: Lock exclusivity
>+   :widths: 5 5 5 5
>+   :header-rows: 1
>+   :stub-columns: 1
>+
>+   * -
>+     - R
>+     - D
>+     - W
>+   * - R
>+     - N
>+     - N
>+     - Y
>+   * - D
>+     - N
>+     - Y
>+     - Y
>+   * - W
>+     - Y
>+     - Y
>+     - Y
>+
>+Here a Y indicates the locks in the matching row/column exclude one another, and
>+N indicates that they do not.
>+
>+Stack expansion
>+---------------
>+
>+Stack expansion throws up additional complexities in that we cannot permit there
>+to be racing page faults, as a result we invoke :c:func:`!vma_start_write` to
>+prevent this in :c:func:`!expand_downwards` or :c:func:`!expand_upwards`.
>--
>2.47.0
Lorenzo Stoakes Nov. 11, 2024, 10:31 a.m. UTC | #10
On Mon, Nov 11, 2024 at 02:42:44AM +0000, Wei Yang wrote:

> >+Read locks are acquired via :c:func:`!vma_start_read`, which is an optimistic
> >+operation, i.e. it tries to acquire a read lock but returns false if it is
> >+unable to do so. At the end of the read operation, :c:func:`!vma_end_read` is
> >+called to release the VMA read lock. This can be done under RCU alone.
> >+
> >+Writing requires the mmap to be write-locked and the VMA lock to be acquired via
> >+:c:func:`!vma_start_write`, however the write lock is released by the termination or
> >+downgrade of the mmap write lock so no :c:func:`!vma_end_write` is required.
> >+
> >+All this is achieved by the use of per-mm and per-VMA sequence counts, which are
> >+used in order to reduce complexity, especially for operations which write-lock
> >+multiple VMAs at once.
> >+
>
> Hi, Lorenzo
>
> I got a question more than a comment when trying to understand the mechanism.
p>
> I am thinking about the benefit of PER_VMA_LOCK.
>
> For write, we always need to grab mmap_lock w/o PER_VMA_LOCK.
> For read, seems we don't need to grab mmap_lock with PER_VMA_LOCK. So read
> operation benefit the most with PER_VMA_LOCK, right?

Yes this is the point of it, VMA fields are read-mostly and the most natural use
for this is page faulting.

The whole intent is to allow for the read lock under RCU.
Lorenzo Stoakes Nov. 11, 2024, 10:39 a.m. UTC | #11
Wei - a side note, when reviewing huuuuge patches like this _please_ remove
irrelevant lines, it's really hard for me to respond and easy for me to
miss stuff scrolling through >600 lines. Thanks!

Also - you're reviewing an out of date series, the new version has changed
a lot, see [0].

[0]:https://lore.kernel.org/all/20241108135708.48567-1-lorenzo.stoakes@oracle.com/


On Mon, Nov 11, 2024 at 07:07:19AM +0000, Wei Yang wrote:
> On Thu, Nov 07, 2024 at 07:01:37PM +0000, Lorenzo Stoakes wrote:
> >+If you want to **write** VMA metadata fields, then things vary depending on the
> >+field (we explore each VMA field in detail below). For the majority you must:
> >+
> >+* Obtain an mmap write lock at the MM granularity via :c:func:`!mmap_write_lock` (or a
> >+  suitable variant), unlocking it with a matching :c:func:`!mmap_write_unlock` when
> >+  you're done with the VMA, *and*
> >+* Obtain a VMA write lock via :c:func:`!vma_start_write` for each VMA you wish to
> >+  modify, which will be released automatically when :c:func:`!mmap_write_unlock` is
> >+  called.
> >+* If you want to be able to write to **any** field, you must also hide the VMA
> >+  from the reverse mapping by obtaining an **rmap write lock**.
>
> Here I am a little confused.
>
> I guess it wants to say mmap write lock and VMA write lock should be obtained,
> while rmap write lock is optional for write operation. And this is what
> illustrated in below section "VMA fields".
>
> But here says 'write to "any" field'. This seems not what it shows below.
>
> Or maybe I misunderstand this part.

It opens with 'for the majority' as in, for the _majority_ you need the
mmap write, VMA write lock.

To able to adjust _all_ you need the rmap write lock too.

I have to trade off with 'summarising' how you get a write lock vs. the
specifics, I mean I literally give a table immediately after this
explicitly listing what is required.

If I gave just that, it'd be confusing and people wouldn't maybe realise
that _mostly_ mmap write, VMA write is enough.

If I don't mention the rmap thing then people might be confused as to why.

Given I give the correct requirements in the table I don't tnink this is a
problem.

[snip]

> >+.. table:: Virtual layout fields
> >+
> >+   ===================== ======================================== ===========
> >+   Field                 Description                              Write lock
> >+   ===================== ======================================== ===========
> >+   :c:member:`!vm_start` Inclusive start virtual address of range mmap write,
> >+                         VMA describes.                           VMA write,
> >+                                                                  rmap write.
> >+   :c:member:`!vm_end`   Exclusive end virtual address of range   mmap write,
> >+                         VMA describes.                           VMA write,
> >+                                                                  rmap write.
> >+   :c:member:`!vm_pgoff` Describes the page offset into the file, rmap write.
>
>                                                                     ^
> I am afraid this one is a typo?

This is fixed in v2.

[snip]
Bagas Sanjaya Nov. 12, 2024, 1:12 a.m. UTC | #12
On Fri, Nov 08, 2024 at 12:34:23PM +0000, Lorenzo Stoakes wrote:
> On Fri, Nov 08, 2024 at 12:29:15PM +0000, Lorenzo Stoakes wrote:
> > On Fri, Nov 08, 2024 at 07:25:38PM +0700, Bagas Sanjaya wrote:
> > > On Thu, Nov 07, 2024 at 07:01:37PM +0000, Lorenzo Stoakes wrote:
> > > > +.. note:: In instances where the architecture supports fewer page tables than
> > > > +   five the kernel cleverly 'folds' page table levels, that is skips them within
> > > > +   the logic, regardless we can act as if there were always five.
> > >
> > > What are being skipped if e.g. we only have 3 or 4 page tables?
> > >
> > > Confused...
> >
> > Page table levels, see [0].
> >
> > Typically achieved through stubbing functions out, etc. So in the code you
> > actually write and _conceptually_ there are five levels, only in the final
> > compile you might have things like p4d_present(), p4d_clear() etc. squashed by
> > the compiler into nothing.
> 
> I have updated this note to be a little clearer and to explicitly state
> this. Thanks!

Thanks for clarifying.
diff mbox series

Patch

diff --git a/Documentation/mm/process_addrs.rst b/Documentation/mm/process_addrs.rst
index e8618fbc62c9..a01a7bcf39ff 100644
--- a/Documentation/mm/process_addrs.rst
+++ b/Documentation/mm/process_addrs.rst
@@ -3,3 +3,681 @@ 
 =================
 Process Addresses
 =================
+
+.. toctree::
+   :maxdepth: 3
+
+
+Userland memory ranges are tracked by the kernel via Virtual Memory Areas or
+'VMA's of type :c:struct:`!struct vm_area_struct`.
+
+Each VMA describes a virtually contiguous memory range with identical
+attributes, each of which described by a :c:struct:`!struct vm_area_struct`
+object. Userland access outside of VMAs is invalid except in the case where an
+adjacent stack VMA could be extended to contain the accessed address.
+
+All VMAs are contained within one and only one virtual address space, described
+by a :c:struct:`!struct mm_struct` object which is referenced by all tasks (that is,
+threads) which share the virtual address space. We refer to this as the
+:c:struct:`!mm`.
+
+Each mm object contains a maple tree data structure which describes all VMAs
+within the virtual address space.
+
+.. note:: An exception to this is the 'gate' VMA which is provided for
+	  architectures which use :c:struct:`!vsyscall` and is a global static
+	  object which does not belong to any specific mm.
+
+-------
+Locking
+-------
+
+The kernel is designed to be highly scalable against concurrent read operations
+on VMA **metadata** so a complicated set of locks are required to ensure memory
+corruption does not occur.
+
+.. note:: Locking VMAs for their metadata does not have any impact on the memory
+	  they describe or the page tables that map them.
+
+Terminology
+-----------
+
+* **mmap locks** - Each MM has a read/write semaphore `mmap_lock` which locks at
+  a process address space granularity which can be acquired via
+  :c:func:`!mmap_read_lock`, :c:func:`!mmap_write_lock` and variants.
+* **VMA locks** - The VMA lock is at VMA granularity (of course) which behaves
+  as a read/write semaphore in practice. A VMA read lock is obtained via
+  :c:func:`!lock_vma_under_rcu` (and unlocked via :c:func:`!vma_end_read`) and a
+  write lock via :c:func:`!vma_start_write` (all VMA write locks are unlocked
+  automatically when the mmap write lock is released). To take a VMA write lock
+  you **must** have already acquired an :c:func:`!mmap_write_lock`.
+* **rmap locks** - When trying to access VMAs through the reverse mapping via a
+  :c:struct:`!struct address_space *` or :c:struct:`!struct anon_vma *` object
+  (each obtainable from a folio), VMAs must be stabilised via
+  :c:func:`!anon_vma_[try]lock_read` or :c:func:`!anon_vma_[try]lock_write` for
+  anonymous memory and :c:func:`!i_mmap_[try]lock_read` or
+  :c:func:`!i_mmap_[try]lock_write` for file-backed memory. We refer to these
+  locks as the reverse mapping locks, or 'rmap locks' for brevity.
+
+We discuss page table locks separately in the dedicated section below.
+
+The first thing **any** of these locks achieve is to **stabilise** the VMA
+within the MM tree. That is, guaranteeing that the VMA object will not be
+deleted from under you nor modified (except for some specific exceptions
+describe below).
+
+Stabilising a VMA also keeps the address space described by it around.
+
+Using address space locks
+-------------------------
+
+If you want to **read** VMA metadata fields or just keep the VMA stable, you
+must do one of the following:
+
+* Obtain an mmap read lock at the MM granularity via :c:func:`!mmap_read_lock` (or a
+  suitable variant), unlocking it with a matching :c:func:`!mmap_read_unlock` when
+  you're done with the VMA, *or*
+* Try to obtain a VMA read lock via :c:func:`!lock_vma_under_rcu`. This tries to
+  acquire the lock atomically so might fail, in which case fall-back logic is
+  required to instead obtain an mmap read lock if this returns :c:macro:`!NULL`,
+  *or*
+* Acquire an rmap lock before traversing the locked interval tree (whether
+  anonymous or file-backed) to obtain the required VMA.
+
+If you want to **write** VMA metadata fields, then things vary depending on the
+field (we explore each VMA field in detail below). For the majority you must:
+
+* Obtain an mmap write lock at the MM granularity via :c:func:`!mmap_write_lock` (or a
+  suitable variant), unlocking it with a matching :c:func:`!mmap_write_unlock` when
+  you're done with the VMA, *and*
+* Obtain a VMA write lock via :c:func:`!vma_start_write` for each VMA you wish to
+  modify, which will be released automatically when :c:func:`!mmap_write_unlock` is
+  called.
+* If you want to be able to write to **any** field, you must also hide the VMA
+  from the reverse mapping by obtaining an **rmap write lock**.
+
+VMA locks are special in that you must obtain an mmap **write** lock **first**
+in order to obtain a VMA **write** lock. A VMA **read** lock however can be
+obtained under an RCU lock alone.
+
+.. note:: The primary users of VMA read locks are page fault handlers, which
+	  means that without a VMA write lock, page faults will run concurrent with
+	  whatever you are doing.
+
+Examining all valid lock states:
+
+.. table::
+
+   ========= ======== ========= ======= ===== =========== ==========
+   mmap lock VMA lock rmap lock Stable? Read? Write most? Write all?
+   ========= ======== ========= ======= ===== =========== ==========
+   \-        \-       \-        N       N     N           N
+   \-        R        \-        Y       Y     N           N
+   \-        \-       R/W       Y       Y     N           N
+   R/W       \-/R     \-/R/W    Y       Y     N           N
+   W         W        \-/R      Y       Y     Y           N
+   W         W        W         Y       Y     Y           Y
+   ========= ======== ========= ======= ===== =========== ==========
+
+.. warning:: While it's possible to obtain a VMA lock while holding an mmap read lock,
+	     attempting to do the reverse is invalid as it can result in deadlock - if
+	     another task already holds an mmap write lock and attempts to acquire a VMA
+	     write lock that will deadlock on the VMA read lock.
+
+All of these locks behave as read/write semaphores in practice, so you can
+obtain either a read or a write lock for both.
+
+.. note:: Generally speaking, a read/write semaphore is a class of lock which
+	  permits concurrent readers. However a write lock can only be obtained
+	  once all readers have left the critical region (and pending readers
+	  made to wait).
+
+	  This renders read locks on a read/write semaphore concurrent with other
+	  readers and write locks exclusive against all others holding the semaphore.
+
+VMA fields
+^^^^^^^^^^
+
+We can subdivide :c:struct:`!struct vm_area_struct` fields by their purpose, which makes it
+easier to explore their locking characteristics:
+
+.. note:: We exclude VMA lock-specific fields here to avoid confusion, as these
+	  are in effect an internal implementation detail.
+
+.. table:: Virtual layout fields
+
+   ===================== ======================================== ===========
+   Field                 Description                              Write lock
+   ===================== ======================================== ===========
+   :c:member:`!vm_start` Inclusive start virtual address of range mmap write,
+                         VMA describes.                           VMA write,
+                                                                  rmap write.
+   :c:member:`!vm_end`   Exclusive end virtual address of range   mmap write,
+                         VMA describes.                           VMA write,
+                                                                  rmap write.
+   :c:member:`!vm_pgoff` Describes the page offset into the file, rmap write.
+                         the original page offset within the      mmap write,
+                         virtual address space (prior to any      rmap write.
+                         :c:func:`!mremap`), or PFN if a PFN map
+                         and the architecture does not support
+                         :c:macro:`!CONFIG_ARCH_HAS_PTE_SPECIAL`.
+   ===================== ======================================== ===========
+
+These fields describes the size, start and end of the VMA, and as such cannot be
+modified without first being hidden from the reverse mapping since these fields
+are used to locate VMAs within the reverse mapping interval trees.
+
+.. table:: Core fields
+
+   ============================ ======================================== =========================
+   Field                        Description                              Write lock
+   ============================ ======================================== =========================
+   :c:member:`!vm_mm`           Containing mm_struct.                    None - written once on
+                                                                         initial map.
+   :c:member:`!vm_page_prot`    Architecture-specific page table         mmap write, VMA write.
+                                protection bits determined from VMA
+                                flags
+   :c:member:`!vm_flags`        Read-only access to VMA flags describing N/A
+                                attributes of the VMA, in union with
+                                private writable
+				:c:member:`!__vm_flags`.
+   :c:member:`!__vm_flags`      Private, writable access to VMA flags    mmap write, VMA write.
+                                field, updated by
+                                :c:func:`!vm_flags_*` functions.
+   :c:member:`!vm_file`         If the VMA is file-backed, points to a   None - written once on
+                                struct file object describing the        initial map.
+                                underlying file, if anonymous then
+				:c:macro:`!NULL`.
+   :c:member:`!vm_ops`          If the VMA is file-backed, then either   None - Written once on
+                                the driver or file-system provides a     initial map by
+                                :c:struct:`!struct vm_operations_struct` :c:func:`!f_ops->mmap()`.
+				object describing callbacks to be
+                                invoked on VMA lifetime events.
+   :c:member:`!vm_private_data` A :c:member:`!void *` field for          Handled by driver.
+                                driver-specific metadata.
+   ============================ ======================================== =========================
+
+These are the core fields which describe the MM the VMA belongs to and its attributes.
+
+.. table:: Config-specific fields
+
+   ================================= ===================== ======================================== ===============
+   Field                             Configuration option  Description                              Write lock
+   ================================= ===================== ======================================== ===============
+   :c:member:`!anon_name`            CONFIG_ANON_VMA_NAME  A field for storing a                    mmap write,
+                                                           :c:struct:`!struct anon_vma_name`        VMA write.
+                                                           object providing a name for anonymous
+                                                           mappings, or :c:macro:`!NULL` if none
+							   is set or the VMA is file-backed.
+   :c:member:`!swap_readahead_info`  CONFIG_SWAP           Metadata used by the swap mechanism      mmap read.
+                                                           to perform readahead.
+   :c:member:`!vm_policy`            CONFIG_NUMA           :c:type:`!mempolicy` object which        mmap write,
+                                                           describes the NUMA behaviour of the      VMA write.
+							   VMA.
+   :c:member:`!numab_state`          CONFIG_NUMA_BALANCING :c:type:`!vma_numab_state` object which  mmap read.
+                                                           describes the current state of
+                                                           NUMA balancing in relation to this VMA.
+                                                           Updated under mmap read lock by
+							   :c:func:`!task_numa_work`.
+   :c:member:`!vm_userfaultfd_ctx`   CONFIG_USERFAULTFD    Userfaultfd context wrapper object of    mmap write,
+                                                           type :c:type:`!vm_userfaultfd_ctx`,      VMA write.
+                                                           either of zero size if userfaultfd is
+                                                           disabled, or containing a pointer
+                                                           to an underlying
+							   :c:type:`!userfaultfd_ctx` object which
+                                                           describes userfaultfd metadata.
+   ================================= ===================== ======================================== ===============
+
+These fields are present or not depending on whether the relevant kernel
+configuration option is set.
+
+.. table:: Reverse mapping fields
+
+   =================================== ========================================= ================
+   Field                               Description                               Write lock
+   =================================== ========================================= ================
+   :c:member:`!shared.rb`              A red/black tree node used, if the        mmap write,
+                                       mapping is file-backed, to place the VMA  VMA write,
+                                       in the                                    i_mmap write.
+                                       :c:member:`!struct address_space->i_mmap`
+				       red/black interval tree.
+   :c:member:`!shared.rb_subtree_last` Metadata used for management of the
+                                       interval tree if the VMA is file-backed.  mmap write,
+                                                                                 VMA write,
+                                                                                 i_mmap write.
+   :c:member:`!anon_vma_chain`         List of links to forked/CoW’d anon_vma    mmap read,
+                                       objects.                                  anon_vma write.
+   :c:member:`!anon_vma`               :c:type:`!anon_vma` object used by        mmap_read,
+                                       anonymous folios mapped exclusively to    page_table_lock.
+				       this VMA.
+   =================================== ========================================= ================
+
+These fields are used to both place the VMA within the reverse mapping, and for
+anonymous mappings, to be able to access both related :c:struct:`!struct anon_vma` objects
+and the :c:struct:`!struct anon_vma` which folios mapped exclusively to this VMA should
+reside.
+
+Page tables
+-----------
+
+We won't speak exhaustively on the subject but broadly speaking, page tables map
+virtual addresses to physical ones through a series of page tables, each of
+which contain entries with physical addresses for the next page table level
+(along with flags), and at the leaf level the physical addresses of the
+underlying physical data pages (with offsets into these pages provided by the
+virtual address itself).
+
+In Linux these are divided into five levels - PGD, P4D, PUD, PMD and PTE. Huge
+pages might eliminate one or two of these levels, but when this is the case we
+typically refer to the leaf level as the PTE level regardless.
+
+.. note:: In instances where the architecture supports fewer page tables than
+   five the kernel cleverly 'folds' page table levels, that is skips them within
+   the logic, regardless we can act as if there were always five.
+
+There are three key operations typically performed on page tables:
+
+1. **Installing** page table mappings - whether creating a new mapping or
+   modifying an existing one.
+2. **Zapping/unmapping** page tables - This is what the kernel calls clearing page
+   table mappings at the leaf level only, whilst leaving all page tables in
+   place. This is a very common operation in the kernel performed on file
+   truncation, the :c:macro:`!MADV_DONTNEED` operation via :c:func:`!madvise`,
+   and others. This is performed by a number of functions including
+   :c:func:`!unmap_mapping_range`, :c:func:`!unmap_mapping_pages` and reverse
+   mapping logic.
+3. **Freeing** page tables - When finally the kernel removes page tables from a
+   userland process (typically via :c:func:`!free_pgtables`) extreme care must
+   be taken to ensure this is done safely, as this logic finally frees all page
+   tables in the specified range, taking no care whatsoever with existing
+   mappings (it assumes the caller has both zapped the range and prevented any
+   further faults within it).
+
+For most kernel developers, cases 1 and 3 are transparent memory management
+implementation details that are handled behind the scenes for you (we explore
+these details below in the implementation section).
+
+When **zapping** ranges, this can be done holding any one of the locks described
+in the terminology section above - that is the mmap lock, the VMA lock or either
+of the reverse mapping locks.
+
+That is - as long as you keep the relevant VMA **stable**, you are good to go
+ahead and zap memory in that VMA's range.
+
+.. warning:: When **freeing** page tables, it must not be possible for VMAs
+	     containing the ranges those page tables map to be accessible via
+	     the reverse mapping.
+
+	     The :c:func:`!free_pgtables` function removes the relevant VMAs
+	     from the reverse mappings, but no other VMAs can be permitted to be
+	     accessible and span the specified range.
+
+Lock ordering
+-------------
+
+As we have multiple locks across the kernel which may or may not be taken at the
+same time as explicit mm or VMA locks, we have to be wary of lock inversion, and
+the **order** in which locks are acquired and released becomes very important.
+
+.. note:: Lock inversion occurs when two threads need to acquire multiple locks,
+   but in doing so inadvertently cause a mutual deadlock.
+
+   For example, consider thread 1 which holds lock A and tries to acquire lock B,
+   while thread 2 holds lock B and tries to acquire lock A.
+
+   Both threads are now deadlocked on each other. However, had they attempted to
+   acquire locks in the same order, one would have waited for the other to
+   complete its work and no deadlock would have occurred.
+
+The opening comment in `mm/rmap.c` describes in detail the required ordering of
+locks within memory management code:
+
+.. code-block::
+
+  inode->i_rwsem	(while writing or truncating, not reading or faulting)
+    mm->mmap_lock
+      mapping->invalidate_lock (in filemap_fault)
+        folio_lock
+          hugetlbfs_i_mmap_rwsem_key (in huge_pmd_share, see hugetlbfs below)
+            vma_start_write
+              mapping->i_mmap_rwsem
+                anon_vma->rwsem
+                  mm->page_table_lock or pte_lock
+                    swap_lock (in swap_duplicate, swap_info_get)
+                      mmlist_lock (in mmput, drain_mmlist and others)
+                      mapping->private_lock (in block_dirty_folio)
+                          i_pages lock (widely used)
+                            lruvec->lru_lock (in folio_lruvec_lock_irq)
+                      inode->i_lock (in set_page_dirty's __mark_inode_dirty)
+                      bdi.wb->list_lock (in set_page_dirty's __mark_inode_dirty)
+                        sb_lock (within inode_lock in fs/fs-writeback.c)
+                        i_pages lock (widely used, in set_page_dirty,
+                                  in arch-dependent flush_dcache_mmap_lock,
+                                  within bdi.wb->list_lock in __sync_single_inode)
+
+Please check the current state of this comment which may have changed since the
+time of writing of this document.
+
+------------------------------
+Locking Implementation Details
+------------------------------
+
+Page table locking details
+--------------------------
+
+In addition to the locks described in the terminology section above, we have
+additional locks dedicated to page tables:
+
+* **Higher level page table locks** - Higher level page tables, that is PGD, P4D
+  and PUD each make use of the process address space granularity
+  :c:member:`!mm->page_table_lock` lock when modified.
+
+* **Fine-grained page table locks** - PMDs and PTEs each have fine-grained locks
+  either kept within the folios describing the page tables or allocated
+  separated and pointed at by the folios if :c:macro:`!ALLOC_SPLIT_PTLOCKS` is
+  set. The PMD spin lock is obtained via :c:func:`!pmd_lock`, however PTEs are
+  mapped into higher memory (if a 32-bit system) and carefully locked via
+  :c:func:`!pte_offset_map_lock`.
+
+These locks represent the minimum required to interact with each page table
+level, but there are further requirements.
+
+Locking rules
+^^^^^^^^^^^^^
+
+We establish basic locking rules when interacting with page tables:
+
+* When changing a page table entry the page table lock for that page table
+  **must** be held.
+* Reads from and writes to page table entries must be appropriately atomic. See
+  the section on atomicity below.
+* Populating previously empty entries requires that the mmap or VMA locks are
+  held, doing so with only rmap locks would risk a race with unmapping logic
+  invoking :c:func:`!unmap_vmas`, so is forbidden.
+* As mentioned above, zapping can be performed while simply keeping the VMA
+  stable, that is holding any one of the mmap, VMA or rmap locks.
+* Special care is required for PTEs, as on 32-bit architectures these must be
+  mapped into high memory and additionally, careful consideration must be
+  applied to racing with THP, migration or other concurrent kernel operations
+  that might steal the entire PTE table from under us. All this is handled by
+  :c:func:`!pte_offset_map_lock`.
+
+There are additional rules applicable when moving page tables, which we discuss
+in the section on this topic below.
+
+.. note:: Interestingly, :c:func:`!pte_offset_map_lock` also maintains an RCU
+          read lock over the mapping (and therefore combined mapping and
+          locking) operation.
+
+Atomicity
+^^^^^^^^^
+
+Page table entries must always be retrieved once and only once before being
+interacted with, as we are operating concurrently with other operations and the
+hardware.
+
+Regardless of page table locks, the MMU hardware will update accessed and dirty
+bits (and in some architectures, perhaps more), and kernel functionality like
+GUP-fast locklessly traverses page tables, so we cannot safely assume that page
+table locks give us exclusive access.
+
+If we hold page table locks and are reading page table entries, then we need
+only ensure that the compiler does not rearrange our loads. This is achieved via
+:c:func:`!pXXp_get` functions - :c:func:`!pgdp_get`, :c:func:`!p4dp_get`,
+:c:func:`!pudp_get`, :c:func:`!pmdp_get`, and :c:func:`!ptep_get`.
+
+Each of these uses :c:func:`!READ_ONCE` to guarantee that the compiler reads
+the page table entry only once.
+
+However, if we wish to manipulate an existing page table entry and care about
+the previously stored data, we must go further and use an hardware atomic
+operation as, for example, in :c:func:`!ptep_get_and_clear`.
+
+Equally, operations that do not rely on the page table locks, such as GUP-fast
+(for instance see :c:func:`!gup_fast` and its various page table level handlers
+like :c:func:`!gup_fast_pte_range`), must very carefully interact with page
+table entries, using functions such as :c:func:`!ptep_get_lockless` and
+equivalent for higher page table levels.
+
+Writes to page table entries must also be appropriately atomic, as established
+by :c:func:`!set_pXX` functions - :c:func:`!set_pgd`, :c:func:`!set_p4d`,
+:c:func:`!set_pud`, :c:func:`!set_pmd`, and :c:func:`!set_pte`.
+
+
+Page table installation
+^^^^^^^^^^^^^^^^^^^^^^^
+
+When allocating a P4D, PUD or PMD and setting the relevant entry in the above
+PGD, P4D or PUD, the :c:member:`!mm->page_table_lock` must be held. This is
+acquired in :c:func:`!__p4d_alloc`, :c:func:`!__pud_alloc` and
+:c:func:`!__pmd_alloc` respectively.
+
+.. note:: :c:func:`!__pmd_alloc` actually invokes :c:func:`!pud_lock` and
+   :c:func:`!pud_lockptr` in turn, however at the time of writing it ultimately
+   references the :c:member:`!mm->page_table_lock`.
+
+Allocating a PTE will either use the :c:member:`!mm->page_table_lock` or, if
+:c:macro:`!USE_SPLIT_PMD_PTLOCKS` is defined, used a lock embedded in the PMD
+physical page metadata in the form of a :c:struct:`!struct ptdesc`, acquired by
+:c:func:`!pmd_ptdesc` called from :c:func:`!pmd_lock` and ultimately
+:c:func:`!__pte_alloc`.
+
+Finally, modifying the contents of the PTE has special treatment, as this is a
+lock that we must acquire whenever we want stable and exclusive access to
+entries pointing to data pages within a PTE, especially when we wish to modify
+them.
+
+This is performed via :c:func:`!pte_offset_map_lock` which carefully checks to
+ensure that the PTE hasn't changed from under us, ultimately invoking
+:c:func:`!pte_lockptr` to obtain a spin lock at PTE granularity contained within
+the :c:struct:`!struct ptdesc` associated with the physical PTE page. The lock
+must be released via :c:func:`!pte_unmap_unlock`.
+
+.. note:: There are some variants on this, such as
+   :c:func:`!pte_offset_map_rw_nolock` when we know we hold the PTE stable but
+   for brevity we do not explore this.  See the comment for
+   :c:func:`!__pte_offset_map_lock` for more details.
+
+When modifying data in ranges we typically only wish to allocate higher page
+tables as necessary, using these locks to avoid races or overwriting anything,
+and set/clear data at the PTE level as required (for instance when page faulting
+or zapping).
+
+Page table freeing
+^^^^^^^^^^^^^^^^^^
+
+Tearing down page tables themselves is something that requires significant
+care. There must be no way that page tables designated for removal can be
+traversed or referenced by concurrent tasks.
+
+It is insufficient to simply hold an mmap write lock and VMA lock (which will
+prevent racing faults, and rmap operations), as a file-backed mapping can be
+truncated under the :c:struct:`!struct address_space` i_mmap_lock alone.
+
+As a result, no VMA which can be accessed via the reverse mapping (either
+anon_vma or the :c:member:`!struct address_space->i_mmap` interval tree) can
+have its page tables torn down.
+
+The operation is typically performed via :c:func:`!free_pgtables`, which assumes
+either the mmap write lock has been taken (as specified by its
+:c:member:`!mm_wr_locked` parameter), or that the VMA is already unreachable.
+
+It carefully removes the VMA from all reverse mappings, however it's important
+that no new ones overlap these or any route remain to permit access to addresses
+within the range whose page tables are being torn down.
+
+As a result of these careful conditions, note that page table entries are
+cleared without page table locks, as it is assumed that all of these precautions
+have already been taken (in the :c:func:`!pgd_clear`, :c:func:`!p4d_clear`,
+:c:func:`!pud_clear`, and :c:func:`!pmd_clear` functions - note that at this
+stage it is assumed that PTE entries have been zapped).
+
+.. note:: It is possible for leaf page tables to be torn down, independent of
+          the page tables above it, as is done by
+          :c:func:`!retract_page_tables`, which is performed under the i_mmap
+          read lock, PMD, and PTE page table locks, without this level of care.
+
+Page table moving
+^^^^^^^^^^^^^^^^^
+
+Some functions manipulate page table levels above PMD (that is PUD, P4D and PGD
+page tables). Most notable of these is :c:func:`!mremap`, which is capable of
+moving higher level page tables.
+
+In these instances, it is either required that **all** locks are taken, that is
+the mmap lock, the VMA lock and the relevant rmap lock, or that the mmap lock
+and VMA locks are taken and some other measure is taken to avoid rmap races (see
+the comment in :c:func:`!move_ptes` in the :c:func:`!mremap` implementation for
+details of how this is handled in this instance).
+
+You can observe that in the :c:func:`!mremap` implementation in the functions
+:c:func:`!take_rmap_locks` and :c:func:`!drop_rmap_locks` which perform the rmap
+side of lock acquisition, invoked ultimately by :c:func:`!move_page_tables`.
+
+VMA lock internals
+------------------
+
+This kind of locking is entirely optimistic - if the lock is contended or a
+competing write has started, then we do not obtain a read lock.
+
+The :c:func:`!lock_vma_under_rcu` function first calls :c:func:`!rcu_read_lock`
+to ensure that the VMA is acquired in an RCU critical section, then attempts to
+VMA lock it via :c:func:`!vma_start_read`, before releasing the RCU lock via
+:c:func:`!rcu_read_unlock`.
+
+VMA read locks hold the read lock on the :c:member:`!vma->vm_lock` semaphore for
+their duration and the caller of :c:func:`!lock_vma_under_rcu` must release it
+via :c:func:`!vma_end_read`.
+
+VMA **write** locks are acquired via :c:func:`!vma_start_write` in instances where a
+VMA is about to be modified, unlike :c:func:`!vma_start_read` the lock is always
+acquired. An mmap write lock **must** be held for the duration of the VMA write
+lock, releasing or downgrading the mmap write lock also releases the VMA write
+lock so there is no :c:func:`!vma_end_write` function.
+
+Note that a semaphore write lock is not held across a VMA lock. Rather, a
+sequence number is used for serialisation, and the write semaphore is only
+acquired at the point of write lock to update this.
+
+This ensures the semantics we require - VMA write locks provide exclusive write
+access to the VMA.
+
+The VMA lock mechanism is designed to be a lightweight means of avoiding the use
+of the heavily contended mmap lock. It is implemented using a combination of a
+read/write semaphore and sequence numbers belonging to the containing
+:c:struct:`!struct mm_struct` and the VMA.
+
+Read locks are acquired via :c:func:`!vma_start_read`, which is an optimistic
+operation, i.e. it tries to acquire a read lock but returns false if it is
+unable to do so. At the end of the read operation, :c:func:`!vma_end_read` is
+called to release the VMA read lock. This can be done under RCU alone.
+
+Writing requires the mmap to be write-locked and the VMA lock to be acquired via
+:c:func:`!vma_start_write`, however the write lock is released by the termination or
+downgrade of the mmap write lock so no :c:func:`!vma_end_write` is required.
+
+All this is achieved by the use of per-mm and per-VMA sequence counts, which are
+used in order to reduce complexity, especially for operations which write-lock
+multiple VMAs at once.
+
+If the mm sequence count, :c:member:`!mm->mm_lock_seq` is equal to the VMA
+sequence count :c:member:`!vma->vm_lock_seq` then the VMA is write-locked. If
+they differ, then they are not.
+
+Each time an mmap write lock is acquired in :c:func:`!mmap_write_lock`,
+:c:func:`!mmap_write_lock_nested`, :c:func:`!mmap_write_lock_killable`, the
+:c:member:`!mm->mm_lock_seq` sequence number is incremented via
+:c:func:`!mm_lock_seqcount_begin`.
+
+Each time the mmap write lock is released in :c:func:`!mmap_write_unlock` or
+:c:func:`!mmap_write_downgrade`, :c:func:`!vma_end_write_all` is invoked which
+also increments :c:member:`!mm->mm_lock_seq` via
+:c:func:`!mm_lock_seqcount_end`.
+
+This way, we ensure regardless of the VMA's sequence number count, that a write
+lock is not incorrectly indicated (since we increment the sequence counter on
+acquiring the mmap write lock, which is required in order to obtain a VMA write
+lock), and that when we release an mmap write lock, we efficiently release
+**all** VMA write locks contained within the mmap at the same time.
+
+The exclusivity of the mmap write lock ensures this is what we want, as there
+would never be a reason to persist per-VMA write locks across multiple mmap
+write lock acquisitions.
+
+Each time a VMA read lock is acquired, we acquire a read lock on the
+:c:member:`!vma->vm_lock` read/write semaphore and hold it, while checking that
+the sequence count of the VMA does not match that of the mm.
+
+If it does, the read lock fails. If it does not, we hold the lock, excluding
+writers, but permitting other readers, who will also obtain this lock under RCU.
+
+Importantly, maple tree operations performed in :c:func:`!lock_vma_under_rcu`
+are also RCU safe, so the whole read lock operation is guaranteed to function
+correctly.
+
+On the write side, we acquire a write lock on the :c:member:`!vma->vm_lock`
+read/write semaphore, before setting the VMA's sequence number under this lock,
+also simultaneously holding the mmap write lock.
+
+This way, if any read locks are in effect, :c:func:`!vma_start_write` will sleep
+until these are finished and mutual exclusion is achieved.
+
+After setting the VMA's sequence number, the lock is released, avoiding
+complexity with a long-term held write lock.
+
+This clever combination of a read/write semaphore and sequence count allows for
+fast RCU-based per-VMA lock acquisition (especially on page fault, though
+utilised elsewhere) with minimal complexity around lock ordering.
+
+mmap write lock downgrading
+---------------------------
+
+When an mmap write lock is held, one has exclusive access to resources within
+the mmap (with the usual caveats about requiring VMA write locks to avoid races
+with tasks holding VMA read locks).
+
+It is then possible to **downgrade** from a write lock to a read lock via
+:c:func:`!mmap_write_downgrade` which, similar to :c:func:`!mmap_write_unlock`,
+implicitly terminates all VMA write locks via :c:func:`!vma_end_write_all`, but
+importantly does not relinquish the mmap lock while downgrading, therefore
+keeping the locked virtual address space stable.
+
+An interesting consequence of this is that downgraded locks will be exclusive
+against any other task possessing a downgraded lock (since they'd have to
+acquire a write lock first to do so, and the lock now being a read lock prevents
+this).
+
+For clarity, mapping read (R)/downgraded write (D)/write (W) locks against one
+another showing which locks exclude the others:
+
+.. list-table:: Lock exclusivity
+   :widths: 5 5 5 5
+   :header-rows: 1
+   :stub-columns: 1
+
+   * -
+     - R
+     - D
+     - W
+   * - R
+     - N
+     - N
+     - Y
+   * - D
+     - N
+     - Y
+     - Y
+   * - W
+     - Y
+     - Y
+     - Y
+
+Here a Y indicates the locks in the matching row/column exclude one another, and
+N indicates that they do not.
+
+Stack expansion
+---------------
+
+Stack expansion throws up additional complexities in that we cannot permit there
+to be racing page faults, as a result we invoke :c:func:`!vma_start_write` to
+prevent this in :c:func:`!expand_downwards` or :c:func:`!expand_upwards`.