diff mbox series

[RFC] docs/mm: add VMA locks documentation

Message ID 20241101185033.131880-1-lorenzo.stoakes@oracle.com (mailing list archive)
State New
Headers show
Series [RFC] docs/mm: add VMA locks documentation | expand

Commit Message

Lorenzo Stoakes Nov. 1, 2024, 6:50 p.m. UTC
Locking around VMAs is complicated and confusing. While we have a number of
disparate comments scattered around the place, we seem to be reaching a
level of complexity that justifies a serious effort at clearly documenting
how locks are expected to be interacted with when it comes to interacting
with mm_struct and vm_area_struct objects.

This is especially pertinent as regards efforts to find sensible
abstractions for these fundamental objects within the kernel rust
abstraction whose compiler strictly requires some means of expressing these
rules (and through this expression can help self-document these
requirements as well as enforce them which is an exciting concept).

The document limits scope to mmap and VMA locks and those that are
immediately adjacent and relevant to them - so additionally covers page
table locking as this is so very closely tied to VMA operations (and relies
upon us handling these correctly).

The document tries to cover some of the nastier and more confusing edge
cases and concerns especially around lock ordering and page table teardown.

The document also provides some VMA lock internals, which are up to date
and inclusive of recent changes to recent sequence number changes.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---

REVIEWERS NOTES:
   You can speed up doc builds by running `make SPHINXDIRS=mm htmldocs`. I
   also uploaded a copy of this to my website at
   https://ljs.io/output/mm/vma_locks to make it easier to have a quick
   read through. Thanks!


 Documentation/mm/index.rst     |   1 +
 Documentation/mm/vma_locks.rst | 527 +++++++++++++++++++++++++++++++++
 2 files changed, 528 insertions(+)
 create mode 100644 Documentation/mm/vma_locks.rst

--
2.47.0

Comments

Lorenzo Stoakes Nov. 1, 2024, 8:58 p.m. UTC | #1
+cc Suren, linux-doc sorry, forgetting cc's all over this evening... (Friday
etc. :)

Suren - could you take a look at the VMA lock stuff + check it's
sane/correct any mistakes? I generated output from this change and uploaded
to my website for review convenience [0].

Thanks!

[0] https://ljs.io/output/mm/vma_locks

On Fri, Nov 01, 2024 at 06:50:33PM +0000, Lorenzo Stoakes wrote:
> Locking around VMAs is complicated and confusing. While we have a number of
> disparate comments scattered around the place, we seem to be reaching a
> level of complexity that justifies a serious effort at clearly documenting
> how locks are expected to be interacted with when it comes to interacting
> with mm_struct and vm_area_struct objects.
>
> This is especially pertinent as regards efforts to find sensible
> abstractions for these fundamental objects within the kernel rust
> abstraction whose compiler strictly requires some means of expressing these
> rules (and through this expression can help self-document these
> requirements as well as enforce them which is an exciting concept).
>
> The document limits scope to mmap and VMA locks and those that are
> immediately adjacent and relevant to them - so additionally covers page
> table locking as this is so very closely tied to VMA operations (and relies
> upon us handling these correctly).
>
> The document tries to cover some of the nastier and more confusing edge
> cases and concerns especially around lock ordering and page table teardown.
>
> The document also provides some VMA lock internals, which are up to date
> and inclusive of recent changes to recent sequence number changes.
>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
>
> REVIEWERS NOTES:
>    You can speed up doc builds by running `make SPHINXDIRS=mm htmldocs`. I
>    also uploaded a copy of this to my website at
>    https://ljs.io/output/mm/vma_locks to make it easier to have a quick
>    read through. Thanks!
>
>
>  Documentation/mm/index.rst     |   1 +
>  Documentation/mm/vma_locks.rst | 527 +++++++++++++++++++++++++++++++++
>  2 files changed, 528 insertions(+)
>  create mode 100644 Documentation/mm/vma_locks.rst
>
> diff --git a/Documentation/mm/index.rst b/Documentation/mm/index.rst
> index 0be1c7503a01..da5f30acaca5 100644
> --- a/Documentation/mm/index.rst
> +++ b/Documentation/mm/index.rst
> @@ -64,3 +64,4 @@ documentation, or deleted if it has served its purpose.
>     vmemmap_dedup
>     z3fold
>     zsmalloc
> +   vma_locks
> diff --git a/Documentation/mm/vma_locks.rst b/Documentation/mm/vma_locks.rst
> new file mode 100644
> index 000000000000..52b9d484376a
> --- /dev/null
> +++ b/Documentation/mm/vma_locks.rst
> @@ -0,0 +1,527 @@
> +VMA Locking
> +===========
> +
> +Overview
> +--------
> +
> +Userland memory ranges are tracked by the kernel via Virtual Memory Areas or
> +'VMA's of type `struct vm_area_struct`.
> +
> +Each VMA describes a virtually contiguous memory range with identical
> +attributes, each of which described by a `struct vm_area_struct`
> +object. Userland access outside of VMAs is invalid except in the case where an
> +adjacent stack VMA could be extended to contain the accessed address.
> +
> +All VMAs are contained within one and only one virtual address space, described
> +by a `struct mm_struct` object which is referenced by all tasks (that is,
> +threads) which share the virtual address space. We refer to this as the `mm`.
> +
> +Each mm object contains a maple tree data structure which describes all VMAs
> +within the virtual address space.
> +
> +The kernel is designed to be highly scalable against concurrent access to
> +userland memory, so a complicated set of locks are required to ensure no data
> +races or memory corruption occurs.
> +
> +This document explores this locking in detail.
> +
> +.. note::
> +
> +   There are three different things that a user might want to achieve via
> +   locks - the first of which is **stability**. That is - ensuring that the VMA
> +   won't be freed or modified in any way from underneath us.
> +
> +   All MM and VMA locks ensure stability.
> +
> +   Secondly we have locks which allow **reads** but not writes (and which might
> +   be held concurrent with other CPUs who also hold the read lock).
> +
> +   Finally, we have locks which permit exclusive access to the VMA to allow for
> +   **writes** to the VMA.
> +
> +MM and VMA locks
> +----------------
> +
> +There are two key classes of lock utilised when reading and manipulating VMAs -
> +the `mmap_lock` which is a read/write semaphore maintained at the `mm_struct`
> +level of granularity and, if CONFIG_PER_VMA_LOCK is set, a per-VMA lock at the
> +VMA level of granularity.
> +
> +.. note::
> +
> +   Generally speaking, a read/write semaphore is a class of lock which permits
> +   concurrent readers. However a write lock can only be obtained once all
> +   readers have left the critical region (and pending readers made to wait).
> +
> +   This renders read locks on a read/write semaphore concurrent with other
> +   readers and write locks exclusive against all others holding the semaphore.
> +
> +If CONFIG_PER_VMA_LOCK is not set, then things are relatively simple - a write
> +mmap lock gives you exclusive write access to a VMA, and a read lock gives you
> +concurrent read-only access.
> +
> +In the presence of CONFIG_PER_VMA_LOCK, i.e. VMA locks, things are more
> +complicated. In this instance, a write semaphore is no longer enough to gain
> +exclusive access to a VMA, a VMA write lock is also required.
> +
> +The VMA lock is implemented via the use of both a read/write semaphore and
> +per-VMA and per-mm sequence numbers. We go into detail on this in the VMA lock
> +internals section below, so for the time being it is important only to note that
> +we can obtain either a VMA read or write lock.
> +
> +.. note::
> +
> +   VMAs under VMA **read** lock are obtained by the `lock_vma_under_rcu()`
> +   function, and **no** existing mmap or VMA lock must be held, This function
> +   either returns a read-locked VMA, or NULL if the lock could not be
> +   acquired. As the name suggests, the VMA will be acquired under RCU, though
> +   once obtained, remains stable.
> +
> +   This kind of locking is entirely optimistic - if the lock is contended or a
> +   competing write has started, then we do not obtain a read lock.
> +
> +   The `lock_vma_under_rcu()` function first calls `rcu_read_lock()` to ensure
> +   that the VMA is acquired in an RCU critical section, then attempts to VMA
> +   lock it via `vma_start_read()`, before releasing the RCU lock via
> +   `rcu_read_unlock()`.
> +
> +   VMA read locks hold the a read lock on the `vma->vm_lock` semaphore for their
> +   duration and the caller of `lock_vma_under_rcu()` must release it via
> +   `vma_end_read()`.
> +
> +   VMA **write** locks are acquired via `vma_start_write()` in instances where a
> +   VMA is about to be modified, unlike `vma_start_read()` the lock is always
> +   acquired. An mmap write lock **must** be held for the duration of the VMA
> +   write lock, releasing or downgrading the mmap write lock also releases the
> +   VMA write lock so there is no `vma_end_write()` function.
> +
> +   Note that a semaphore write lock is not held across a VMA lock. Rather, a
> +   sequence number is used for serialisation, and the write semaphore is only
> +   acquired at the point of write lock to update this (we explore this in detail
> +   in the VMA lock internals section below).
> +
> +   This ensures the semantics we require - VMA write locks provide exclusive
> +   write access to the VMA.
> +
> +Examining all valid lock state and what each implies:
> +
> +.. list-table::
> +   :header-rows: 1
> +
> +   * - mmap lock
> +     - VMA lock
> +     - Stable?
> +     - Can read safely?
> +     - Can write safely?
> +   * - \-
> +     - \-
> +     - N
> +     - N
> +     - N
> +   * - R
> +     - \-
> +     - Y
> +     - Y
> +     - N
> +   * - \-
> +     - R
> +     - Y
> +     - Y
> +     - N
> +   * - W
> +     - \-
> +     - Y
> +     - Y
> +     - N
> +   * - W
> +     - W
> +     - Y
> +     - Y
> +     - Y
> +
> +Note that there are some exceptions to this - the `anon_vma` field is permitted
> +to be written to under mmap read lock and is instead serialised by the `struct
> +mm_struct` field `page_table_lock`. In addition the `vm_mm` and all
> +lock-specific fields are permitted to be read under RCU alone  (though stability cannot
> +be expected in this instance).
> +
> +.. note::
> +   The most notable place to use the VMA read lock is on page table faults on
> +   the x86-64 architecture, which importantly means that without a VMA write
> +   lock, page faults can race against you even if you hold an mmap write lock.
> +
> +VMA Fields
> +----------
> +
> +We examine each field of the `struct vm_area_struct` type in detail in the table
> +below.
> +
> +Reading of each field requires either an mmap read lock or a VMA read lock to be
> +held, except where 'unstable RCU read' is specified, in which case unstable
> +access to the field is permitted under RCU alone.
> +
> +The table specifies which write locks must be held to write to the field.
> +
> +.. list-table::
> +   :widths: 20 10 22 5 20
> +   :header-rows: 1
> +
> +   * - Field
> +     - Config
> +     - Description
> +     - Unstable RCU read?
> +     - Write Lock
> +   * - vm_start
> +     -
> +     - Inclusive start virtual address of range VMA describes.
> +     -
> +     - mmap write, VMA write
> +   * - vm_end
> +     -
> +     - Exclusive end virtual address of range VMA describes.
> +     -
> +     - mmap write, VMA write
> +   * - vm_rcu
> +     - vma lock
> +     - RCU list head, in union with vma_start, vma_end. RCU implementation detail.
> +     - N/A
> +     - N/A
> +   * - vm_mm
> +     -
> +     - Containing mm_struct.
> +     - Y
> +     - (Static)
> +   * - vm_page_prot
> +     -
> +     - Architecture-specific page table protection bits determined from VMA
> +       flags
> +     -
> +     - mmap write, VMA write
> +   * - vm_flags
> +     -
> +     - Read-only access to VMA flags describing attributes of VMA, in union with
> +       private writable `__vm_flags`.
> +     -
> +     - N/A
> +   * - __vm_flags
> +     -
> +     - Private, writable access to VMA flags field, updated by vm_flags_*()
> +       functions.
> +     -
> +     - mmap write, VMA write
> +   * - detached
> +     - vma lock
> +     - VMA lock implementation detail - indicates whether the VMA has been
> +       detached from the tree.
> +     - Y
> +     - mmap write, VMA write
> +   * - vm_lock_seq
> +     - vma lock
> +     - VMA lock implementation detail - A sequence number used to serialise the
> +       VMA lock, see the VMA lock section below.
> +     - Y
> +     - mmap write, VMA write
> +   * - vm_lock
> +     - vma lock
> +     - VMA lock implementation detail - A pointer to the VMA lock read/write
> +       semaphore.
> +     - Y
> +     - None required
> +   * - shared.rb
> +     -
> +     - A red/black tree node used, if the mapping is file-backed, to place the
> +       VMA in the `struct address_space->i_mmap` red/black interval tree.
> +     -
> +     - mmap write, VMA write, i_mmap write
> +   * - shared.rb_subtree_last
> +     -
> +     - Metadata used for management of the interval tree if the VMA is
> +       file-backed.
> +     -
> +     - mmap write, VMA write, i_mmap write
> +   * - anon_vma_chain
> +     -
> +     - List of links to forked/CoW'd `anon_vma` objects.
> +     -
> +     - mmap read or above, anon_vma write lock
> +   * - anon_vma
> +     -
> +     - `anon_vma` object used by anonymous folios mapped exclusively to this VMA.
> +     -
> +     - mmap read or above, page_table_lock
> +   * - vm_ops
> +     -
> +     - If the VMA is file-backed, then either the driver or file-system provides
> +       a `struct vm_operations_struct` object describing callbacks to be invoked
> +       on specific VMA lifetime events.
> +     -
> +     - (Static)
> +   * - vm_pgoff
> +     -
> +     - Describes the page offset into the file, the original page offset within
> +       the virtual address space (prior to any `mremap()`), or PFN if a PFN map.
> +     -
> +     - mmap write, VMA write
> +   * - vm_file
> +     -
> +     - If the VMA is file-backed, points to a `struct file` object describing
> +       the underlying file, if anonymous then `NULL`.
> +     -
> +     - (Static)
> +   * - vm_private_data
> +     -
> +     - A `void *` field for driver-specific metadata.
> +     -
> +     - Driver-mandated.
> +   * - anon_name
> +     - anon name
> +     - A field for storing a `struct anon_vma_name` object providing a name for
> +       anonymous mappings, or `NULL` if none is set or the VMA is file-backed.
> +     -
> +     - mmap write, VMA write
> +   * - swap_readahead_info
> +     - swap
> +     - Metadata used by the swap mechanism to perform readahead.
> +     -
> +     - mmap read
> +   * - vm_region
> +     - nommu
> +     - The containing region for the VMA for architectures which do not
> +       possess an MMU.
> +     - N/A
> +     - N/A
> +   * - vm_policy
> +     - numa
> +     - `mempolicy` object which describes NUMA behaviour of the VMA.
> +     -
> +     - mmap write, VMA write
> +   * - numab_state
> +     - numab
> +     - `vma_numab_state` object which describes the current state of NUMA
> +       balancing in relation to this VMA.
> +     -
> +     - mmap write, VMA write
> +   * - vm_userfaultfd_ctx
> +     -
> +     - Userfaultfd context wrapper object of type `vm_userfaultfd_ctx`, either
> +       of zero size if userfaultfd is disabled, or containing a pointer to an
> +       underlying `userfaultfd_ctx` object which describes userfaultfd metadata.
> +     -
> +     - mmap write, VMA write
> +
> +.. note::
> +
> +   In the config column 'vma lock' configuration means CONFIG_PER_VMA_LOCK,
> +   'anon name' means CONFIG_ANON_VMA_NAME, 'swap' means CONFIG_SWAP, 'nommu'
> +   means that CONFIG_MMU is not set, 'numa' means CONFIG_NUMA and 'numab' means
> +   CONFIG_NUMA_BALANCING'.
> +
> +   In the write lock column '(Static)' means that the field is set only once
> +   upon initialisation of the VMA and not changed after this, the VMA would
> +   either have been under an mmap write and VMA write lock at the time or not
> +   yet inserted into any tree.
> +
> +Page table locks
> +----------------
> +
> +When allocating a P4D, PUD or PMD and setting the relevant entry in the above
> +PGD, P4D or PUD, the `mm->page_table_lock` is acquired to do so. This is
> +acquired in `__p4d_alloc()`, `__pud_alloc()` and `__pmd_alloc()` respectively.
> +
> +.. note::
> +   `__pmd_alloc()` actually invokes `pud_lock()` and `pud_lockptr()` in turn,
> +   however at the time of writing it ultimately references the
> +   `mm->page_table_lock`.
> +
> +Allocating a PTE will either use the `mm->page_table_lock` or, if
> +`USE_SPLIT_PMD_PTLOCKS` is defined, used a lock embedded in the PMD physical
> +page metadata in the form of a `struct ptdesc`, acquired by `pmd_ptdesc()`
> +called from `pmd_lock()` and ultimately `__pte_alloc()`.
> +
> +Finally, modifying the contents of the PTE has special treatment, as this is a
> +lock that we must acquire whenever we want stable and exclusive access to
> +entries pointing to data pages within a PTE, especially when we wish to modify
> +them.
> +
> +This is performed via `pte_offset_map_lock()` which carefully checks to ensure
> +that the PTE hasn't changed from under us, ultimately invoking `pte_lockptr()`
> +to obtain a spin lock at PTE granularity contained within the `struct ptdesc`
> +associated with the physical PTE page. The lock must be released via
> +`pte_unmap_unlock()`.
> +
> +.. note::
> +   There are some variants on this, such as `pte_offset_map_rw_nolock()` when we
> +   know we hold the PTE stable but for brevity we do not explore this.
> +   See the comment for `__pte_offset_map_lock()` for more details.
> +
> +When modifying data in ranges we typically only wish to allocate higher page
> +tables as necessary, using these locks to avoid races or overwriting anything,
> +and set/clear data at the PTE level as required (for instance when page faulting
> +or zapping).
> +
> +Page table teardown
> +-------------------
> +
> +Tearing down page tables themselves is something that requires significant
> +care. There must be no way that page tables designated for removal can be
> +traversed or referenced by concurrent tasks.
> +
> +It is insufficient to simply hold an mmap write lock and VMA lock (which will
> +prevent racing faults, and rmap operations), as a file-backed mapping can be
> +truncated under the `struct address_space` i_mmap_lock alone.
> +
> +As a result, no VMA which can be accessed via the reverse mapping (either
> +anon_vma or the `struct address_space->i_mmap` interval tree) can have its page
> +tables torn down.
> +
> +The operation is typically performed via `free_pgtables()`, which assumes either
> +the mmap write lock has been taken (as specified by its `mm_wr_locked`
> +parameter), or that it the VMA is fully detached.
> +
> +It carefully removes the VMA from all reverse mappings, however it's important
> +that no new ones overlap these or any route remain to permit access to addresses
> +within the range whose page tables are being torn down.
> +
> +As a result of these careful conditions, note that page table entries are
> +cleared without page table locks, as it is assumed that all of these precautions
> +have already been taken.
> +
> +mmap write lock downgrading
> +---------------------------
> +
> +While it is possible to obtain an mmap write or read lock using the
> +`mm->mmap_lock` read/write semaphore, it is also possible to **downgrade** from
> +a write lock to a read lock via `mmap_write_downgrade()`.
> +
> +Similar to `mmap_write_unlock()`, this implicitly terminates all VMA write locks
> +via `vma_end_write_all()` (more or this behaviour in the VMA lock internals
> +section below), but importantly does not relinquish the mmap lock while
> +downgrading, therefore keeping the locked virtual address space stable.
> +
> +A subtlety here is that callers can assume, if they invoke an
> +mmap_write_downgrade() operation, that they still have exclusive access to the
> +virtual address space (excluding VMA read lock holders), as for another task to
> +have downgraded they would have had to have exclusive access to the semaphore
> +which can't be the case until the current task completes what it is doing.
> +
> +Stack expansion
> +---------------
> +
> +Stack expansion throws up additional complexities in that we cannot permit there
> +to be racing page faults, as a result we invoke `vma_start_write()` to prevent
> +this in `expand_downwards()` or `expand_upwards()`.
> +
> +Lock ordering
> +-------------
> +
> +As we have multiple locks across the kernel which may or may not be taken at the
> +same time as explicit mm or VMA locks, we have to be wary of lock inversion, and
> +the **order** in which locks are acquired and released becomes very important.
> +
> +.. note::
> +
> +   Lock inversion occurs when two threads need to acquire multiple locks,
> +   but in doing so inadvertently cause a mutual deadlock.
> +
> +   For example, consider thread 1 which holds lock A and tries to acquire lock B,
> +   while thread 2 holds lock B and tries to acquire lock A.
> +
> +   Both threads are now deadlocked on each other. However, had they attempted to
> +   acquire locks in the same order, one would have waited for the other to
> +   complete its work and no deadlock would have occurred.
> +
> +The opening comment in `mm/rmap.c` describes in detail the required ordering of
> +locks within memory management code:
> +
> +.. code-block::
> +
> +  inode->i_rwsem	(while writing or truncating, not reading or faulting)
> +    mm->mmap_lock
> +      mapping->invalidate_lock (in filemap_fault)
> +        folio_lock
> +          hugetlbfs_i_mmap_rwsem_key (in huge_pmd_share, see hugetlbfs below)
> +            vma_start_write
> +              mapping->i_mmap_rwsem
> +                anon_vma->rwsem
> +                  mm->page_table_lock or pte_lock
> +                    swap_lock (in swap_duplicate, swap_info_get)
> +                      mmlist_lock (in mmput, drain_mmlist and others)
> +                      mapping->private_lock (in block_dirty_folio)
> +                          i_pages lock (widely used)
> +                            lruvec->lru_lock (in folio_lruvec_lock_irq)
> +                      inode->i_lock (in set_page_dirty's __mark_inode_dirty)
> +                      bdi.wb->list_lock (in set_page_dirty's __mark_inode_dirty)
> +                        sb_lock (within inode_lock in fs/fs-writeback.c)
> +                        i_pages lock (widely used, in set_page_dirty,
> +                                  in arch-dependent flush_dcache_mmap_lock,
> +                                  within bdi.wb->list_lock in __sync_single_inode)
> +
> +Please check the current state of this comment which may have changed since the
> +time of writing of this document.
> +
> +VMA lock internals
> +------------------
> +
> +The VMA lock mechanism is designed to be a lightweight means of avoiding the use
> +of the heavily contended mmap lock. It is implemented using a combination of a
> +read/write semaphore and sequence numbers belonging to the containing `struct
> +mm_struct` and the VMA.
> +
> +Read locks are acquired via `vma_start_read()`, which is an optimistic
> +operation, i.e. it tries to acquire a read lock but returns false if it is
> +unable to do so. At the end of the read operation, `vma_end_read()` is called to
> +release the VMA read lock. This can be done under RCU alone.
> +
> +Writing requires the mmap to be write-locked and the VMA lock to be acquired via
> +`vma_start_write()`, however the write lock is released by the termination or
> +downgrade of the mmap write lock so no `vma_end_write()` is required.
> +
> +All this is achieved by the use of per-mm and per-VMA sequence counts. This is
> +used to reduce complexity, and potential especially around operations which
> +write-lock multiple VMAs at once.
> +
> +If the mm sequence count, `mm->mm_lock_seq` is equal to the VMA sequence count
> +`vma->vm_lock_seq` then the VMA is write-locked. If they differ, then they are
> +not.
> +
> +Each time an mmap write lock is acquired in `mmap_write_lock()`,
> +`mmap_write_lock_nested()`, `mmap_write_lock_killable()`, the `mm->mm_lock_seq`
> +sequence number is incremented via `mm_lock_seqcount_begin()`.
> +
> +Each time the mmap write lock is released in `mmap_write_unlock()` or
> +`mmap_write_downgrade()`, `vma_end_write_all()` is invoked which also increments
> +`mm->mm_lock_seq` via `mm_lock_seqcount_end()`.
> +
> +This way, we ensure regardless of the VMA's sequence number count, that a write
> +lock is not incorrectly indicated (since we increment the sequence counter on
> +acquiring the mmap write lock, which is required in order to obtain a VMA write
> +lock), and that when we release an mmap write lock, we efficiently release
> +**all** VMA write locks contained within the mmap at the same time.
> +
> +The exclusivity of the mmap write lock ensures this is what we want, as there
> +would never be a reason to persist per-VMA write locks across multiple mmap
> +write lock acquisitions.
> +
> +Each time a VMA read lock is acquired, we acquire a read lock on the
> +`vma->vm_lock` read/write semaphore and hold it, while checking that the
> +sequence count of the VMA does not match that of the mm.
> +
> +If it does, the read lock fails. If it does not, we hold the lock, excluding
> +writers, but permitting other readers, who will also obtain this lock under RCU.
> +
> +Importantly, maple tree operations performed in `lock_vma_under_rcu()` are also
> +RCU safe, so the whole read lock operation is guaranteed to function correctly.
> +
> +On the write side, we acquire a write lock on the `vma->vm_lock` read/write
> +semaphore, before setting the VMA's sequence number under this lock, also
> +simultaneously holding the mmap write lock.
> +
> +This way, if any read locks are in effect, `vma_start_write()` will sleep until
> +these are finished and mutual exclusion is achieved.
> +
> +After setting the VMA's sequence number, the lock is released, avoiding
> +complexity with a long-term held write lock.
> +
> +This clever combination of a read/write semaphore and sequence count allows for
> +fast RCU-based per-VMA lock acquisition (especially on x86-64 page fault, though
> +utilised elsewhere) with minimal complexity around lock ordering.
> --
> 2.47.0
Suren Baghdasaryan Nov. 1, 2024, 10:41 p.m. UTC | #2
On Fri, Nov 1, 2024 at 1:58 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> +cc Suren, linux-doc sorry, forgetting cc's all over this evening... (Friday
> etc. :)
>
> Suren - could you take a look at the VMA lock stuff + check it's
> sane/correct any mistakes? I generated output from this change and uploaded
> to my website for review convenience [0].

Thanks! I'll take a look over the weekend. Quite ironically, I'm
currently working on some changes to vm_lock (moving it into
vm_area_struct, making vm_area_struct SLAB_TYPESAFE_BY_RCU, etc).
So... yeah, your timing is impeccable as usual!

>
> Thanks!
>
> [0] https://ljs.io/output/mm/vma_locks
>
> On Fri, Nov 01, 2024 at 06:50:33PM +0000, Lorenzo Stoakes wrote:
> > Locking around VMAs is complicated and confusing. While we have a number of
> > disparate comments scattered around the place, we seem to be reaching a
> > level of complexity that justifies a serious effort at clearly documenting
> > how locks are expected to be interacted with when it comes to interacting
> > with mm_struct and vm_area_struct objects.
> >
> > This is especially pertinent as regards efforts to find sensible
> > abstractions for these fundamental objects within the kernel rust
> > abstraction whose compiler strictly requires some means of expressing these
> > rules (and through this expression can help self-document these
> > requirements as well as enforce them which is an exciting concept).
> >
> > The document limits scope to mmap and VMA locks and those that are
> > immediately adjacent and relevant to them - so additionally covers page
> > table locking as this is so very closely tied to VMA operations (and relies
> > upon us handling these correctly).
> >
> > The document tries to cover some of the nastier and more confusing edge
> > cases and concerns especially around lock ordering and page table teardown.
> >
> > The document also provides some VMA lock internals, which are up to date
> > and inclusive of recent changes to recent sequence number changes.
> >
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > ---
> >
> > REVIEWERS NOTES:
> >    You can speed up doc builds by running `make SPHINXDIRS=mm htmldocs`. I
> >    also uploaded a copy of this to my website at
> >    https://ljs.io/output/mm/vma_locks to make it easier to have a quick
> >    read through. Thanks!
> >
> >
> >  Documentation/mm/index.rst     |   1 +
> >  Documentation/mm/vma_locks.rst | 527 +++++++++++++++++++++++++++++++++
> >  2 files changed, 528 insertions(+)
> >  create mode 100644 Documentation/mm/vma_locks.rst
> >
> > diff --git a/Documentation/mm/index.rst b/Documentation/mm/index.rst
> > index 0be1c7503a01..da5f30acaca5 100644
> > --- a/Documentation/mm/index.rst
> > +++ b/Documentation/mm/index.rst
> > @@ -64,3 +64,4 @@ documentation, or deleted if it has served its purpose.
> >     vmemmap_dedup
> >     z3fold
> >     zsmalloc
> > +   vma_locks
> > diff --git a/Documentation/mm/vma_locks.rst b/Documentation/mm/vma_locks.rst
> > new file mode 100644
> > index 000000000000..52b9d484376a
> > --- /dev/null
> > +++ b/Documentation/mm/vma_locks.rst
> > @@ -0,0 +1,527 @@
> > +VMA Locking
> > +===========
> > +
> > +Overview
> > +--------
> > +
> > +Userland memory ranges are tracked by the kernel via Virtual Memory Areas or
> > +'VMA's of type `struct vm_area_struct`.
> > +
> > +Each VMA describes a virtually contiguous memory range with identical
> > +attributes, each of which described by a `struct vm_area_struct`
> > +object. Userland access outside of VMAs is invalid except in the case where an
> > +adjacent stack VMA could be extended to contain the accessed address.
> > +
> > +All VMAs are contained within one and only one virtual address space, described
> > +by a `struct mm_struct` object which is referenced by all tasks (that is,
> > +threads) which share the virtual address space. We refer to this as the `mm`.
> > +
> > +Each mm object contains a maple tree data structure which describes all VMAs
> > +within the virtual address space.
> > +
> > +The kernel is designed to be highly scalable against concurrent access to
> > +userland memory, so a complicated set of locks are required to ensure no data
> > +races or memory corruption occurs.
> > +
> > +This document explores this locking in detail.
> > +
> > +.. note::
> > +
> > +   There are three different things that a user might want to achieve via
> > +   locks - the first of which is **stability**. That is - ensuring that the VMA
> > +   won't be freed or modified in any way from underneath us.
> > +
> > +   All MM and VMA locks ensure stability.
> > +
> > +   Secondly we have locks which allow **reads** but not writes (and which might
> > +   be held concurrent with other CPUs who also hold the read lock).
> > +
> > +   Finally, we have locks which permit exclusive access to the VMA to allow for
> > +   **writes** to the VMA.
> > +
> > +MM and VMA locks
> > +----------------
> > +
> > +There are two key classes of lock utilised when reading and manipulating VMAs -
> > +the `mmap_lock` which is a read/write semaphore maintained at the `mm_struct`
> > +level of granularity and, if CONFIG_PER_VMA_LOCK is set, a per-VMA lock at the
> > +VMA level of granularity.
> > +
> > +.. note::
> > +
> > +   Generally speaking, a read/write semaphore is a class of lock which permits
> > +   concurrent readers. However a write lock can only be obtained once all
> > +   readers have left the critical region (and pending readers made to wait).
> > +
> > +   This renders read locks on a read/write semaphore concurrent with other
> > +   readers and write locks exclusive against all others holding the semaphore.
> > +
> > +If CONFIG_PER_VMA_LOCK is not set, then things are relatively simple - a write
> > +mmap lock gives you exclusive write access to a VMA, and a read lock gives you
> > +concurrent read-only access.
> > +
> > +In the presence of CONFIG_PER_VMA_LOCK, i.e. VMA locks, things are more
> > +complicated. In this instance, a write semaphore is no longer enough to gain
> > +exclusive access to a VMA, a VMA write lock is also required.
> > +
> > +The VMA lock is implemented via the use of both a read/write semaphore and
> > +per-VMA and per-mm sequence numbers. We go into detail on this in the VMA lock
> > +internals section below, so for the time being it is important only to note that
> > +we can obtain either a VMA read or write lock.
> > +
> > +.. note::
> > +
> > +   VMAs under VMA **read** lock are obtained by the `lock_vma_under_rcu()`
> > +   function, and **no** existing mmap or VMA lock must be held, This function
> > +   either returns a read-locked VMA, or NULL if the lock could not be
> > +   acquired. As the name suggests, the VMA will be acquired under RCU, though
> > +   once obtained, remains stable.
> > +
> > +   This kind of locking is entirely optimistic - if the lock is contended or a
> > +   competing write has started, then we do not obtain a read lock.
> > +
> > +   The `lock_vma_under_rcu()` function first calls `rcu_read_lock()` to ensure
> > +   that the VMA is acquired in an RCU critical section, then attempts to VMA
> > +   lock it via `vma_start_read()`, before releasing the RCU lock via
> > +   `rcu_read_unlock()`.
> > +
> > +   VMA read locks hold the a read lock on the `vma->vm_lock` semaphore for their
> > +   duration and the caller of `lock_vma_under_rcu()` must release it via
> > +   `vma_end_read()`.
> > +
> > +   VMA **write** locks are acquired via `vma_start_write()` in instances where a
> > +   VMA is about to be modified, unlike `vma_start_read()` the lock is always
> > +   acquired. An mmap write lock **must** be held for the duration of the VMA
> > +   write lock, releasing or downgrading the mmap write lock also releases the
> > +   VMA write lock so there is no `vma_end_write()` function.
> > +
> > +   Note that a semaphore write lock is not held across a VMA lock. Rather, a
> > +   sequence number is used for serialisation, and the write semaphore is only
> > +   acquired at the point of write lock to update this (we explore this in detail
> > +   in the VMA lock internals section below).
> > +
> > +   This ensures the semantics we require - VMA write locks provide exclusive
> > +   write access to the VMA.
> > +
> > +Examining all valid lock state and what each implies:
> > +
> > +.. list-table::
> > +   :header-rows: 1
> > +
> > +   * - mmap lock
> > +     - VMA lock
> > +     - Stable?
> > +     - Can read safely?
> > +     - Can write safely?
> > +   * - \-
> > +     - \-
> > +     - N
> > +     - N
> > +     - N
> > +   * - R
> > +     - \-
> > +     - Y
> > +     - Y
> > +     - N
> > +   * - \-
> > +     - R
> > +     - Y
> > +     - Y
> > +     - N
> > +   * - W
> > +     - \-
> > +     - Y
> > +     - Y
> > +     - N
> > +   * - W
> > +     - W
> > +     - Y
> > +     - Y
> > +     - Y
> > +
> > +Note that there are some exceptions to this - the `anon_vma` field is permitted
> > +to be written to under mmap read lock and is instead serialised by the `struct
> > +mm_struct` field `page_table_lock`. In addition the `vm_mm` and all
> > +lock-specific fields are permitted to be read under RCU alone  (though stability cannot
> > +be expected in this instance).
> > +
> > +.. note::
> > +   The most notable place to use the VMA read lock is on page table faults on
> > +   the x86-64 architecture, which importantly means that without a VMA write
> > +   lock, page faults can race against you even if you hold an mmap write lock.
> > +
> > +VMA Fields
> > +----------
> > +
> > +We examine each field of the `struct vm_area_struct` type in detail in the table
> > +below.
> > +
> > +Reading of each field requires either an mmap read lock or a VMA read lock to be
> > +held, except where 'unstable RCU read' is specified, in which case unstable
> > +access to the field is permitted under RCU alone.
> > +
> > +The table specifies which write locks must be held to write to the field.
> > +
> > +.. list-table::
> > +   :widths: 20 10 22 5 20
> > +   :header-rows: 1
> > +
> > +   * - Field
> > +     - Config
> > +     - Description
> > +     - Unstable RCU read?
> > +     - Write Lock
> > +   * - vm_start
> > +     -
> > +     - Inclusive start virtual address of range VMA describes.
> > +     -
> > +     - mmap write, VMA write
> > +   * - vm_end
> > +     -
> > +     - Exclusive end virtual address of range VMA describes.
> > +     -
> > +     - mmap write, VMA write
> > +   * - vm_rcu
> > +     - vma lock
> > +     - RCU list head, in union with vma_start, vma_end. RCU implementation detail.
> > +     - N/A
> > +     - N/A
> > +   * - vm_mm
> > +     -
> > +     - Containing mm_struct.
> > +     - Y
> > +     - (Static)
> > +   * - vm_page_prot
> > +     -
> > +     - Architecture-specific page table protection bits determined from VMA
> > +       flags
> > +     -
> > +     - mmap write, VMA write
> > +   * - vm_flags
> > +     -
> > +     - Read-only access to VMA flags describing attributes of VMA, in union with
> > +       private writable `__vm_flags`.
> > +     -
> > +     - N/A
> > +   * - __vm_flags
> > +     -
> > +     - Private, writable access to VMA flags field, updated by vm_flags_*()
> > +       functions.
> > +     -
> > +     - mmap write, VMA write
> > +   * - detached
> > +     - vma lock
> > +     - VMA lock implementation detail - indicates whether the VMA has been
> > +       detached from the tree.
> > +     - Y
> > +     - mmap write, VMA write
> > +   * - vm_lock_seq
> > +     - vma lock
> > +     - VMA lock implementation detail - A sequence number used to serialise the
> > +       VMA lock, see the VMA lock section below.
> > +     - Y
> > +     - mmap write, VMA write
> > +   * - vm_lock
> > +     - vma lock
> > +     - VMA lock implementation detail - A pointer to the VMA lock read/write
> > +       semaphore.
> > +     - Y
> > +     - None required
> > +   * - shared.rb
> > +     -
> > +     - A red/black tree node used, if the mapping is file-backed, to place the
> > +       VMA in the `struct address_space->i_mmap` red/black interval tree.
> > +     -
> > +     - mmap write, VMA write, i_mmap write
> > +   * - shared.rb_subtree_last
> > +     -
> > +     - Metadata used for management of the interval tree if the VMA is
> > +       file-backed.
> > +     -
> > +     - mmap write, VMA write, i_mmap write
> > +   * - anon_vma_chain
> > +     -
> > +     - List of links to forked/CoW'd `anon_vma` objects.
> > +     -
> > +     - mmap read or above, anon_vma write lock
> > +   * - anon_vma
> > +     -
> > +     - `anon_vma` object used by anonymous folios mapped exclusively to this VMA.
> > +     -
> > +     - mmap read or above, page_table_lock
> > +   * - vm_ops
> > +     -
> > +     - If the VMA is file-backed, then either the driver or file-system provides
> > +       a `struct vm_operations_struct` object describing callbacks to be invoked
> > +       on specific VMA lifetime events.
> > +     -
> > +     - (Static)
> > +   * - vm_pgoff
> > +     -
> > +     - Describes the page offset into the file, the original page offset within
> > +       the virtual address space (prior to any `mremap()`), or PFN if a PFN map.
> > +     -
> > +     - mmap write, VMA write
> > +   * - vm_file
> > +     -
> > +     - If the VMA is file-backed, points to a `struct file` object describing
> > +       the underlying file, if anonymous then `NULL`.
> > +     -
> > +     - (Static)
> > +   * - vm_private_data
> > +     -
> > +     - A `void *` field for driver-specific metadata.
> > +     -
> > +     - Driver-mandated.
> > +   * - anon_name
> > +     - anon name
> > +     - A field for storing a `struct anon_vma_name` object providing a name for
> > +       anonymous mappings, or `NULL` if none is set or the VMA is file-backed.
> > +     -
> > +     - mmap write, VMA write
> > +   * - swap_readahead_info
> > +     - swap
> > +     - Metadata used by the swap mechanism to perform readahead.
> > +     -
> > +     - mmap read
> > +   * - vm_region
> > +     - nommu
> > +     - The containing region for the VMA for architectures which do not
> > +       possess an MMU.
> > +     - N/A
> > +     - N/A
> > +   * - vm_policy
> > +     - numa
> > +     - `mempolicy` object which describes NUMA behaviour of the VMA.
> > +     -
> > +     - mmap write, VMA write
> > +   * - numab_state
> > +     - numab
> > +     - `vma_numab_state` object which describes the current state of NUMA
> > +       balancing in relation to this VMA.
> > +     -
> > +     - mmap write, VMA write
> > +   * - vm_userfaultfd_ctx
> > +     -
> > +     - Userfaultfd context wrapper object of type `vm_userfaultfd_ctx`, either
> > +       of zero size if userfaultfd is disabled, or containing a pointer to an
> > +       underlying `userfaultfd_ctx` object which describes userfaultfd metadata.
> > +     -
> > +     - mmap write, VMA write
> > +
> > +.. note::
> > +
> > +   In the config column 'vma lock' configuration means CONFIG_PER_VMA_LOCK,
> > +   'anon name' means CONFIG_ANON_VMA_NAME, 'swap' means CONFIG_SWAP, 'nommu'
> > +   means that CONFIG_MMU is not set, 'numa' means CONFIG_NUMA and 'numab' means
> > +   CONFIG_NUMA_BALANCING'.
> > +
> > +   In the write lock column '(Static)' means that the field is set only once
> > +   upon initialisation of the VMA and not changed after this, the VMA would
> > +   either have been under an mmap write and VMA write lock at the time or not
> > +   yet inserted into any tree.
> > +
> > +Page table locks
> > +----------------
> > +
> > +When allocating a P4D, PUD or PMD and setting the relevant entry in the above
> > +PGD, P4D or PUD, the `mm->page_table_lock` is acquired to do so. This is
> > +acquired in `__p4d_alloc()`, `__pud_alloc()` and `__pmd_alloc()` respectively.
> > +
> > +.. note::
> > +   `__pmd_alloc()` actually invokes `pud_lock()` and `pud_lockptr()` in turn,
> > +   however at the time of writing it ultimately references the
> > +   `mm->page_table_lock`.
> > +
> > +Allocating a PTE will either use the `mm->page_table_lock` or, if
> > +`USE_SPLIT_PMD_PTLOCKS` is defined, used a lock embedded in the PMD physical
> > +page metadata in the form of a `struct ptdesc`, acquired by `pmd_ptdesc()`
> > +called from `pmd_lock()` and ultimately `__pte_alloc()`.
> > +
> > +Finally, modifying the contents of the PTE has special treatment, as this is a
> > +lock that we must acquire whenever we want stable and exclusive access to
> > +entries pointing to data pages within a PTE, especially when we wish to modify
> > +them.
> > +
> > +This is performed via `pte_offset_map_lock()` which carefully checks to ensure
> > +that the PTE hasn't changed from under us, ultimately invoking `pte_lockptr()`
> > +to obtain a spin lock at PTE granularity contained within the `struct ptdesc`
> > +associated with the physical PTE page. The lock must be released via
> > +`pte_unmap_unlock()`.
> > +
> > +.. note::
> > +   There are some variants on this, such as `pte_offset_map_rw_nolock()` when we
> > +   know we hold the PTE stable but for brevity we do not explore this.
> > +   See the comment for `__pte_offset_map_lock()` for more details.
> > +
> > +When modifying data in ranges we typically only wish to allocate higher page
> > +tables as necessary, using these locks to avoid races or overwriting anything,
> > +and set/clear data at the PTE level as required (for instance when page faulting
> > +or zapping).
> > +
> > +Page table teardown
> > +-------------------
> > +
> > +Tearing down page tables themselves is something that requires significant
> > +care. There must be no way that page tables designated for removal can be
> > +traversed or referenced by concurrent tasks.
> > +
> > +It is insufficient to simply hold an mmap write lock and VMA lock (which will
> > +prevent racing faults, and rmap operations), as a file-backed mapping can be
> > +truncated under the `struct address_space` i_mmap_lock alone.
> > +
> > +As a result, no VMA which can be accessed via the reverse mapping (either
> > +anon_vma or the `struct address_space->i_mmap` interval tree) can have its page
> > +tables torn down.
> > +
> > +The operation is typically performed via `free_pgtables()`, which assumes either
> > +the mmap write lock has been taken (as specified by its `mm_wr_locked`
> > +parameter), or that it the VMA is fully detached.
> > +
> > +It carefully removes the VMA from all reverse mappings, however it's important
> > +that no new ones overlap these or any route remain to permit access to addresses
> > +within the range whose page tables are being torn down.
> > +
> > +As a result of these careful conditions, note that page table entries are
> > +cleared without page table locks, as it is assumed that all of these precautions
> > +have already been taken.
> > +
> > +mmap write lock downgrading
> > +---------------------------
> > +
> > +While it is possible to obtain an mmap write or read lock using the
> > +`mm->mmap_lock` read/write semaphore, it is also possible to **downgrade** from
> > +a write lock to a read lock via `mmap_write_downgrade()`.
> > +
> > +Similar to `mmap_write_unlock()`, this implicitly terminates all VMA write locks
> > +via `vma_end_write_all()` (more or this behaviour in the VMA lock internals
> > +section below), but importantly does not relinquish the mmap lock while
> > +downgrading, therefore keeping the locked virtual address space stable.
> > +
> > +A subtlety here is that callers can assume, if they invoke an
> > +mmap_write_downgrade() operation, that they still have exclusive access to the
> > +virtual address space (excluding VMA read lock holders), as for another task to
> > +have downgraded they would have had to have exclusive access to the semaphore
> > +which can't be the case until the current task completes what it is doing.
> > +
> > +Stack expansion
> > +---------------
> > +
> > +Stack expansion throws up additional complexities in that we cannot permit there
> > +to be racing page faults, as a result we invoke `vma_start_write()` to prevent
> > +this in `expand_downwards()` or `expand_upwards()`.
> > +
> > +Lock ordering
> > +-------------
> > +
> > +As we have multiple locks across the kernel which may or may not be taken at the
> > +same time as explicit mm or VMA locks, we have to be wary of lock inversion, and
> > +the **order** in which locks are acquired and released becomes very important.
> > +
> > +.. note::
> > +
> > +   Lock inversion occurs when two threads need to acquire multiple locks,
> > +   but in doing so inadvertently cause a mutual deadlock.
> > +
> > +   For example, consider thread 1 which holds lock A and tries to acquire lock B,
> > +   while thread 2 holds lock B and tries to acquire lock A.
> > +
> > +   Both threads are now deadlocked on each other. However, had they attempted to
> > +   acquire locks in the same order, one would have waited for the other to
> > +   complete its work and no deadlock would have occurred.
> > +
> > +The opening comment in `mm/rmap.c` describes in detail the required ordering of
> > +locks within memory management code:
> > +
> > +.. code-block::
> > +
> > +  inode->i_rwsem     (while writing or truncating, not reading or faulting)
> > +    mm->mmap_lock
> > +      mapping->invalidate_lock (in filemap_fault)
> > +        folio_lock
> > +          hugetlbfs_i_mmap_rwsem_key (in huge_pmd_share, see hugetlbfs below)
> > +            vma_start_write
> > +              mapping->i_mmap_rwsem
> > +                anon_vma->rwsem
> > +                  mm->page_table_lock or pte_lock
> > +                    swap_lock (in swap_duplicate, swap_info_get)
> > +                      mmlist_lock (in mmput, drain_mmlist and others)
> > +                      mapping->private_lock (in block_dirty_folio)
> > +                          i_pages lock (widely used)
> > +                            lruvec->lru_lock (in folio_lruvec_lock_irq)
> > +                      inode->i_lock (in set_page_dirty's __mark_inode_dirty)
> > +                      bdi.wb->list_lock (in set_page_dirty's __mark_inode_dirty)
> > +                        sb_lock (within inode_lock in fs/fs-writeback.c)
> > +                        i_pages lock (widely used, in set_page_dirty,
> > +                                  in arch-dependent flush_dcache_mmap_lock,
> > +                                  within bdi.wb->list_lock in __sync_single_inode)
> > +
> > +Please check the current state of this comment which may have changed since the
> > +time of writing of this document.
> > +
> > +VMA lock internals
> > +------------------
> > +
> > +The VMA lock mechanism is designed to be a lightweight means of avoiding the use
> > +of the heavily contended mmap lock. It is implemented using a combination of a
> > +read/write semaphore and sequence numbers belonging to the containing `struct
> > +mm_struct` and the VMA.
> > +
> > +Read locks are acquired via `vma_start_read()`, which is an optimistic
> > +operation, i.e. it tries to acquire a read lock but returns false if it is
> > +unable to do so. At the end of the read operation, `vma_end_read()` is called to
> > +release the VMA read lock. This can be done under RCU alone.
> > +
> > +Writing requires the mmap to be write-locked and the VMA lock to be acquired via
> > +`vma_start_write()`, however the write lock is released by the termination or
> > +downgrade of the mmap write lock so no `vma_end_write()` is required.
> > +
> > +All this is achieved by the use of per-mm and per-VMA sequence counts. This is
> > +used to reduce complexity, and potential especially around operations which
> > +write-lock multiple VMAs at once.
> > +
> > +If the mm sequence count, `mm->mm_lock_seq` is equal to the VMA sequence count
> > +`vma->vm_lock_seq` then the VMA is write-locked. If they differ, then they are
> > +not.
> > +
> > +Each time an mmap write lock is acquired in `mmap_write_lock()`,
> > +`mmap_write_lock_nested()`, `mmap_write_lock_killable()`, the `mm->mm_lock_seq`
> > +sequence number is incremented via `mm_lock_seqcount_begin()`.
> > +
> > +Each time the mmap write lock is released in `mmap_write_unlock()` or
> > +`mmap_write_downgrade()`, `vma_end_write_all()` is invoked which also increments
> > +`mm->mm_lock_seq` via `mm_lock_seqcount_end()`.
> > +
> > +This way, we ensure regardless of the VMA's sequence number count, that a write
> > +lock is not incorrectly indicated (since we increment the sequence counter on
> > +acquiring the mmap write lock, which is required in order to obtain a VMA write
> > +lock), and that when we release an mmap write lock, we efficiently release
> > +**all** VMA write locks contained within the mmap at the same time.
> > +
> > +The exclusivity of the mmap write lock ensures this is what we want, as there
> > +would never be a reason to persist per-VMA write locks across multiple mmap
> > +write lock acquisitions.
> > +
> > +Each time a VMA read lock is acquired, we acquire a read lock on the
> > +`vma->vm_lock` read/write semaphore and hold it, while checking that the
> > +sequence count of the VMA does not match that of the mm.
> > +
> > +If it does, the read lock fails. If it does not, we hold the lock, excluding
> > +writers, but permitting other readers, who will also obtain this lock under RCU.
> > +
> > +Importantly, maple tree operations performed in `lock_vma_under_rcu()` are also
> > +RCU safe, so the whole read lock operation is guaranteed to function correctly.
> > +
> > +On the write side, we acquire a write lock on the `vma->vm_lock` read/write
> > +semaphore, before setting the VMA's sequence number under this lock, also
> > +simultaneously holding the mmap write lock.
> > +
> > +This way, if any read locks are in effect, `vma_start_write()` will sleep until
> > +these are finished and mutual exclusion is achieved.
> > +
> > +After setting the VMA's sequence number, the lock is released, avoiding
> > +complexity with a long-term held write lock.
> > +
> > +This clever combination of a read/write semaphore and sequence count allows for
> > +fast RCU-based per-VMA lock acquisition (especially on x86-64 page fault, though
> > +utilised elsewhere) with minimal complexity around lock ordering.
> > --
> > 2.47.0
SeongJae Park Nov. 1, 2024, 11:48 p.m. UTC | #3
On Fri, 1 Nov 2024 20:58:39 +0000 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:

[...]
> On Fri, Nov 01, 2024 at 06:50:33PM +0000, Lorenzo Stoakes wrote:
> > Locking around VMAs is complicated and confusing. While we have a number of
> > disparate comments scattered around the place, we seem to be reaching a
> > level of complexity that justifies a serious effort at clearly documenting
> > how locks are expected to be interacted with when it comes to interacting
> > with mm_struct and vm_area_struct objects.
> >
> > This is especially pertinent as regards efforts to find sensible
> > abstractions for these fundamental objects within the kernel rust
> > abstraction whose compiler strictly requires some means of expressing these
> > rules (and through this expression can help self-document these
> > requirements as well as enforce them which is an exciting concept).
> >
> > The document limits scope to mmap and VMA locks and those that are
> > immediately adjacent and relevant to them - so additionally covers page
> > table locking as this is so very closely tied to VMA operations (and relies
> > upon us handling these correctly).
> >
> > The document tries to cover some of the nastier and more confusing edge
> > cases and concerns especially around lock ordering and page table teardown.
> >
> > The document also provides some VMA lock internals, which are up to date
> > and inclusive of recent changes to recent sequence number changes.
> >
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

Acked-by: SeongJae Park <sj@kernel.org>

> > ---
> >
> > REVIEWERS NOTES:
> >    You can speed up doc builds by running `make SPHINXDIRS=mm htmldocs`. I
> >    also uploaded a copy of this to my website at
> >    https://ljs.io/output/mm/vma_locks to make it easier to have a quick
> >    read through. Thanks!
> >
> >
> >  Documentation/mm/index.rst     |   1 +
> >  Documentation/mm/vma_locks.rst | 527 +++++++++++++++++++++++++++++++++
> >  2 files changed, 528 insertions(+)
> >  create mode 100644 Documentation/mm/vma_locks.rst
> >
> > diff --git a/Documentation/mm/index.rst b/Documentation/mm/index.rst
> > index 0be1c7503a01..da5f30acaca5 100644
> > --- a/Documentation/mm/index.rst
> > +++ b/Documentation/mm/index.rst
> > @@ -64,3 +64,4 @@ documentation, or deleted if it has served its purpose.
> >     vmemmap_dedup
> >     z3fold
> >     zsmalloc
> > +   vma_locks

This is the "Unsorted Documentation" section.  If the document is really for
the section, I'd suggest putting it in alphabetically sorted order, for the
consistency.  However, if putting the document under the section is not your
real intention, I think it might be better to be put under "Process Addresses"
section above.  What do you think?


Thanks,
SJ

[...]
Jann Horn Nov. 2, 2024, 1:45 a.m. UTC | #4
On Fri, Nov 1, 2024 at 7:50 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
> Locking around VMAs is complicated and confusing. While we have a number of
> disparate comments scattered around the place, we seem to be reaching a
> level of complexity that justifies a serious effort at clearly documenting
> how locks are expected to be interacted with when it comes to interacting
> with mm_struct and vm_area_struct objects.
>
> This is especially pertinent as regards efforts to find sensible
> abstractions for these fundamental objects within the kernel rust
> abstraction whose compiler strictly requires some means of expressing these
> rules (and through this expression can help self-document these
> requirements as well as enforce them which is an exciting concept).
>
> The document limits scope to mmap and VMA locks and those that are
> immediately adjacent and relevant to them - so additionally covers page
> table locking as this is so very closely tied to VMA operations (and relies
> upon us handling these correctly).
>
> The document tries to cover some of the nastier and more confusing edge
> cases and concerns especially around lock ordering and page table teardown.
>
> The document also provides some VMA lock internals, which are up to date
> and inclusive of recent changes to recent sequence number changes.

Thanks for doing this!

> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
>
> REVIEWERS NOTES:
>    You can speed up doc builds by running `make SPHINXDIRS=mm htmldocs`. I
>    also uploaded a copy of this to my website at
>    https://ljs.io/output/mm/vma_locks to make it easier to have a quick
>    read through. Thanks!
>
>
>  Documentation/mm/index.rst     |   1 +
>  Documentation/mm/vma_locks.rst | 527 +++++++++++++++++++++++++++++++++
>  2 files changed, 528 insertions(+)
>  create mode 100644 Documentation/mm/vma_locks.rst
>
> diff --git a/Documentation/mm/index.rst b/Documentation/mm/index.rst
> index 0be1c7503a01..da5f30acaca5 100644
> --- a/Documentation/mm/index.rst
> +++ b/Documentation/mm/index.rst
> @@ -64,3 +64,4 @@ documentation, or deleted if it has served its purpose.
>     vmemmap_dedup
>     z3fold
>     zsmalloc
> +   vma_locks
> diff --git a/Documentation/mm/vma_locks.rst b/Documentation/mm/vma_locks.rst
> new file mode 100644
> index 000000000000..52b9d484376a
> --- /dev/null
> +++ b/Documentation/mm/vma_locks.rst
> @@ -0,0 +1,527 @@
> +VMA Locking
> +===========
> +
> +Overview
> +--------
> +
> +Userland memory ranges are tracked by the kernel via Virtual Memory Areas or
> +'VMA's of type `struct vm_area_struct`.
> +
> +Each VMA describes a virtually contiguous memory range with identical
> +attributes, each of which described by a `struct vm_area_struct`
> +object. Userland access outside of VMAs is invalid except in the case where an
> +adjacent stack VMA could be extended to contain the accessed address.
> +
> +All VMAs are contained within one and only one virtual address space, described
> +by a `struct mm_struct` object which is referenced by all tasks (that is,
> +threads) which share the virtual address space. We refer to this as the `mm`.
> +
> +Each mm object contains a maple tree data structure which describes all VMAs
> +within the virtual address space.

The gate VMA is special, on architectures that have it: Userland
access to its area is allowed, but the area is outside the VA range
managed by the normal MM code, and the gate VMA is a global object
(not per-MM), and only a few places in MM code can interact with it
(for example, page fault handling can't, but GUP can through
get_gate_page()).

(I think this also has the fun consequence that vm_normal_page() can
get called on a VMA whose ->vm_mm is NULL, when called from
get_gate_page().)

> +The kernel is designed to be highly scalable against concurrent access to
> +userland memory, so a complicated set of locks are required to ensure no data
> +races or memory corruption occurs.
> +
> +This document explores this locking in detail.
> +
> +.. note::
> +
> +   There are three different things that a user might want to achieve via
> +   locks - the first of which is **stability**. That is - ensuring that the VMA
> +   won't be freed or modified in any way from underneath us.
> +
> +   All MM and VMA locks ensure stability.
> +
> +   Secondly we have locks which allow **reads** but not writes (and which might
> +   be held concurrent with other CPUs who also hold the read lock).
> +
> +   Finally, we have locks which permit exclusive access to the VMA to allow for
> +   **writes** to the VMA.

Maybe also mention that there are three major paths you can follow to
reach a VMA? You can come through the mm's VMA tree, you can do an
anon page rmap walk that goes page -> anon_vma -> vma, or you can do a
file rmap walk from the address_space. Which is why just holding the
mmap lock and vma lock in write mode is not enough to permit arbitrary
changes to a VMA struct.

> +MM and VMA locks
> +----------------
> +
> +There are two key classes of lock utilised when reading and manipulating VMAs -
> +the `mmap_lock` which is a read/write semaphore maintained at the `mm_struct`
> +level of granularity and, if CONFIG_PER_VMA_LOCK is set, a per-VMA lock at the
> +VMA level of granularity.
> +
> +.. note::
> +
> +   Generally speaking, a read/write semaphore is a class of lock which permits
> +   concurrent readers. However a write lock can only be obtained once all
> +   readers have left the critical region (and pending readers made to wait).
> +
> +   This renders read locks on a read/write semaphore concurrent with other
> +   readers and write locks exclusive against all others holding the semaphore.
> +
> +If CONFIG_PER_VMA_LOCK is not set, then things are relatively simple - a write
> +mmap lock gives you exclusive write access to a VMA, and a read lock gives you
> +concurrent read-only access.
> +
> +In the presence of CONFIG_PER_VMA_LOCK, i.e. VMA locks, things are more
> +complicated. In this instance, a write semaphore is no longer enough to gain
> +exclusive access to a VMA, a VMA write lock is also required.
> +
> +The VMA lock is implemented via the use of both a read/write semaphore and
> +per-VMA and per-mm sequence numbers. We go into detail on this in the VMA lock
> +internals section below, so for the time being it is important only to note that
> +we can obtain either a VMA read or write lock.
> +
> +.. note::
> +
> +   VMAs under VMA **read** lock are obtained by the `lock_vma_under_rcu()`
> +   function, and **no** existing mmap or VMA lock must be held, This function

uffd_move_lock() calls lock_vma_under_rcu() after having already
VMA-locked another VMA with uffd_lock_vma().

> +   either returns a read-locked VMA, or NULL if the lock could not be
> +   acquired. As the name suggests, the VMA will be acquired under RCU, though
> +   once obtained, remains stable.
> +   This kind of locking is entirely optimistic - if the lock is contended or a
> +   competing write has started, then we do not obtain a read lock.
> +
> +   The `lock_vma_under_rcu()` function first calls `rcu_read_lock()` to ensure
> +   that the VMA is acquired in an RCU critical section, then attempts to VMA
> +   lock it via `vma_start_read()`, before releasing the RCU lock via
> +   `rcu_read_unlock()`.
> +
> +   VMA read locks hold the a read lock on the `vma->vm_lock` semaphore for their

nit: s/ the a / a /

> +   duration and the caller of `lock_vma_under_rcu()` must release it via
> +   `vma_end_read()`.
> +
> +   VMA **write** locks are acquired via `vma_start_write()` in instances where a
> +   VMA is about to be modified, unlike `vma_start_read()` the lock is always
> +   acquired. An mmap write lock **must** be held for the duration of the VMA
> +   write lock, releasing or downgrading the mmap write lock also releases the
> +   VMA write lock so there is no `vma_end_write()` function.
> +
> +   Note that a semaphore write lock is not held across a VMA lock. Rather, a
> +   sequence number is used for serialisation, and the write semaphore is only
> +   acquired at the point of write lock to update this (we explore this in detail
> +   in the VMA lock internals section below).
> +
> +   This ensures the semantics we require - VMA write locks provide exclusive
> +   write access to the VMA.
> +
> +Examining all valid lock state and what each implies:
> +
> +.. list-table::
> +   :header-rows: 1
> +
> +   * - mmap lock
> +     - VMA lock
> +     - Stable?
> +     - Can read safely?
> +     - Can write safely?
> +   * - \-
> +     - \-
> +     - N
> +     - N
> +     - N
> +   * - R
> +     - \-
> +     - Y
> +     - Y
> +     - N
> +   * - \-
> +     - R
> +     - Y
> +     - Y
> +     - N
> +   * - W
> +     - \-
> +     - Y
> +     - Y
> +     - N
> +   * - W
> +     - W
> +     - Y
> +     - Y
> +     - Y
> +
> +Note that there are some exceptions to this - the `anon_vma` field is permitted
> +to be written to under mmap read lock and is instead serialised by the `struct
> +mm_struct` field `page_table_lock`. In addition the `vm_mm` and all

Hm, we really ought to add some smp_store_release() and READ_ONCE(),
or something along those lines, around our ->anon_vma accesses...
especially the "vma->anon_vma = anon_vma" assignment in
__anon_vma_prepare() looks to me like, on architectures like arm64
with write-write reordering, we could theoretically end up making a
new anon_vma pointer visible to a concurrent page fault before the
anon_vma has been initialized? Though I have no idea if that is
practically possible, stuff would have to be reordered quite a bit for
that to happen...

> +lock-specific fields are permitted to be read under RCU alone  (though stability cannot
> +be expected in this instance).
> +
> +.. note::
> +   The most notable place to use the VMA read lock is on page table faults on

s/page table faults/page faults/?

> +   the x86-64 architecture, which importantly means that without a VMA write

it's wired up to a bunch of architectures at this point - arm, arm64,
powerpc, riscv, s390, x86 all use lock_vma_under_rcu().

> +   lock, page faults can race against you even if you hold an mmap write lock.
> +
> +VMA Fields
> +----------
> +
> +We examine each field of the `struct vm_area_struct` type in detail in the table
> +below.
> +
> +Reading of each field requires either an mmap read lock or a VMA read lock to be
> +held, except where 'unstable RCU read' is specified, in which case unstable
> +access to the field is permitted under RCU alone.
> +
> +The table specifies which write locks must be held to write to the field.

vm_start, vm_end and vm_pgoff also require that the associated
address_space and anon_vma (if applicable) are write-locked, and that
their rbtrees are updated as needed.

> +.. list-table::
> +   :widths: 20 10 22 5 20
> +   :header-rows: 1
> +
> +   * - Field
> +     - Config
> +     - Description
> +     - Unstable RCU read?
> +     - Write Lock
> +   * - vm_start
> +     -
> +     - Inclusive start virtual address of range VMA describes.
> +     -
> +     - mmap write, VMA write
> +   * - vm_end
> +     -
> +     - Exclusive end virtual address of range VMA describes.
> +     -
> +     - mmap write, VMA write
> +   * - vm_rcu
> +     - vma lock
> +     - RCU list head, in union with vma_start, vma_end. RCU implementation detail.
> +     - N/A
> +     - N/A
> +   * - vm_mm
> +     -
> +     - Containing mm_struct.
> +     - Y
> +     - (Static)
> +   * - vm_page_prot
> +     -
> +     - Architecture-specific page table protection bits determined from VMA
> +       flags
> +     -
> +     - mmap write, VMA write
> +   * - vm_flags
> +     -
> +     - Read-only access to VMA flags describing attributes of VMA, in union with
> +       private writable `__vm_flags`.
> +     -
> +     - N/A
> +   * - __vm_flags
> +     -
> +     - Private, writable access to VMA flags field, updated by vm_flags_*()
> +       functions.
> +     -
> +     - mmap write, VMA write
> +   * - detached
> +     - vma lock
> +     - VMA lock implementation detail - indicates whether the VMA has been
> +       detached from the tree.
> +     - Y
> +     - mmap write, VMA write
> +   * - vm_lock_seq
> +     - vma lock
> +     - VMA lock implementation detail - A sequence number used to serialise the
> +       VMA lock, see the VMA lock section below.
> +     - Y
> +     - mmap write, VMA write

I think "mmap write" is accurate, but "VMA write" is inaccurate -
you'd need to have already written to the vm_lock_seq in order to have
a VMA write lock.

> +   * - vm_lock
> +     - vma lock
> +     - VMA lock implementation detail - A pointer to the VMA lock read/write
> +       semaphore.
> +     - Y
> +     - None required
> +   * - shared.rb
> +     -
> +     - A red/black tree node used, if the mapping is file-backed, to place the
> +       VMA in the `struct address_space->i_mmap` red/black interval tree.
> +     -
> +     - mmap write, VMA write, i_mmap write
> +   * - shared.rb_subtree_last
> +     -
> +     - Metadata used for management of the interval tree if the VMA is
> +       file-backed.
> +     -
> +     - mmap write, VMA write, i_mmap write
> +   * - anon_vma_chain
> +     -
> +     - List of links to forked/CoW'd `anon_vma` objects.
> +     -
> +     - mmap read or above, anon_vma write lock
> +   * - anon_vma
> +     -
> +     - `anon_vma` object used by anonymous folios mapped exclusively to this VMA.
> +     -
> +     - mmap read or above, page_table_lock
> +   * - vm_ops
> +     -
> +     - If the VMA is file-backed, then either the driver or file-system provides
> +       a `struct vm_operations_struct` object describing callbacks to be invoked
> +       on specific VMA lifetime events.
> +     -
> +     - (Static)
> +   * - vm_pgoff
> +     -
> +     - Describes the page offset into the file, the original page offset within
> +       the virtual address space (prior to any `mremap()`), or PFN if a PFN map.

Ooh, right, I had forgotten about this quirk, and I think I never
fully understood these rules... it's a PFN if the VMA is
private+maywrite+pfnmap. And the vma->vm_pgoff is set in
remap_pfn_range_internal() under those conditions.

Huh, so for example, if you are in an environment where usbdev_mmap()
uses remap_pfn_range() (which depends on hardware - it seems to work
inside QEMU but not on real machine), and you have at least read
access to a device at /dev/bus/usb/*/* (which are normally
world-readable), you can actually do this:

user@vm:/tmp$ cat usb-get-physaddr.c
#include <err.h>
#include <stdlib.h>
#include <fcntl.h>
#include <sys/mman.h>
#define SYSCHK(x) ({          \
  typeof(x) __res = (x);      \
  if (__res == (typeof(x))-1) \
    err(1, "SYSCHK(" #x ")"); \
  __res;                      \
})
int main(int argc, char **argv) {
  if (argc != 2)
    errx(1, "expect one argument (usbdev path)");
  int fd = SYSCHK(open(argv[1], O_RDONLY));
  SYSCHK(mmap((void*)0x10000, 0x1000, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_FIXED_NOREPLACE, fd, 0));
  system("head -n1 /proc/$PPID/maps");
}
user@vm:/tmp$ gcc -o usb-get-physaddr usb-get-physaddr.c
user@vm:/tmp$ ./usb-get-physaddr /dev/bus/usb/001/001
00010000-00011000 rw-p 0103f000 00:06 135
  /dev/bus/usb/001/001
user@vm:/tmp$ ./usb-get-physaddr /dev/bus/usb/001/001
00010000-00011000 rw-p 0103f000 00:06 135
  /dev/bus/usb/001/001
user@vm:/tmp$ ./usb-get-physaddr /dev/bus/usb/001/001
00010000-00011000 rw-p 0107e000 00:06 135
  /dev/bus/usb/001/001
user@vm:/tmp$ ./usb-get-physaddr /dev/bus/usb/001/001
00010000-00011000 rw-p 010bd000 00:06 135
  /dev/bus/usb/001/001
user@vm:/tmp$

and see physical addresses in the offset field in /proc/*/maps...
that's not great. And pointless on architectures with
CONFIG_ARCH_HAS_PTE_SPECIAL, from what I can tell.


> +     -
> +     - mmap write, VMA write
> +   * - vm_file
> +     -
> +     - If the VMA is file-backed, points to a `struct file` object describing
> +       the underlying file, if anonymous then `NULL`.
> +     -
> +     - (Static)
> +   * - vm_private_data
> +     -
> +     - A `void *` field for driver-specific metadata.
> +     -
> +     - Driver-mandated.
> +   * - anon_name
> +     - anon name
> +     - A field for storing a `struct anon_vma_name` object providing a name for
> +       anonymous mappings, or `NULL` if none is set or the VMA is file-backed.
> +     -
> +     - mmap write, VMA write
> +   * - swap_readahead_info
> +     - swap
> +     - Metadata used by the swap mechanism to perform readahead.
> +     -
> +     - mmap read
> +   * - vm_region
> +     - nommu
> +     - The containing region for the VMA for architectures which do not
> +       possess an MMU.
> +     - N/A
> +     - N/A
> +   * - vm_policy
> +     - numa
> +     - `mempolicy` object which describes NUMA behaviour of the VMA.
> +     -
> +     - mmap write, VMA write
> +   * - numab_state
> +     - numab
> +     - `vma_numab_state` object which describes the current state of NUMA
> +       balancing in relation to this VMA.
> +     -
> +     - mmap write, VMA write

I think task_numa_work() is only holding the mmap lock in read mode
when it sets this pointer to a non-NULL value.

> +   * - vm_userfaultfd_ctx
> +     -
> +     - Userfaultfd context wrapper object of type `vm_userfaultfd_ctx`, either
> +       of zero size if userfaultfd is disabled, or containing a pointer to an
> +       underlying `userfaultfd_ctx` object which describes userfaultfd metadata.
> +     -
> +     - mmap write, VMA write
> +
> +.. note::
> +
> +   In the config column 'vma lock' configuration means CONFIG_PER_VMA_LOCK,
> +   'anon name' means CONFIG_ANON_VMA_NAME, 'swap' means CONFIG_SWAP, 'nommu'
> +   means that CONFIG_MMU is not set, 'numa' means CONFIG_NUMA and 'numab' means
> +   CONFIG_NUMA_BALANCING'.
> +
> +   In the write lock column '(Static)' means that the field is set only once
> +   upon initialisation of the VMA and not changed after this, the VMA would
> +   either have been under an mmap write and VMA write lock at the time or not
> +   yet inserted into any tree.
> +
> +Page table locks
> +----------------
> +
> +When allocating a P4D, PUD or PMD and setting the relevant entry in the above
> +PGD, P4D or PUD, the `mm->page_table_lock` is acquired to do so. This is
> +acquired in `__p4d_alloc()`, `__pud_alloc()` and `__pmd_alloc()` respectively.
> +
> +.. note::
> +   `__pmd_alloc()` actually invokes `pud_lock()` and `pud_lockptr()` in turn,
> +   however at the time of writing it ultimately references the
> +   `mm->page_table_lock`.
> +
> +Allocating a PTE will either use the `mm->page_table_lock` or, if
> +`USE_SPLIT_PMD_PTLOCKS` is defined, used a lock embedded in the PMD physical
> +page metadata in the form of a `struct ptdesc`, acquired by `pmd_ptdesc()`
> +called from `pmd_lock()` and ultimately `__pte_alloc()`.
>+
> +Finally, modifying the contents of the PTE has special treatment, as this is a
> +lock that we must acquire whenever we want stable and exclusive access to
> +entries pointing to data pages within a PTE, especially when we wish to modify
> +them.

I guess one other perspective on this would be to focus on the
circumstances under which you're allowed to write entries:

0. page tables can be concurrently read by hardware and GUP-fast, so
writes must always be appropriately atomic
1. changing a page table entry always requires locking the containing
page table (except when the write is an A/D update by hardware)
2. in page tables higher than PMD level, page table entries that point
to page tables can only be changed to point to something else when
holding all the relevant high-level locks leading to the VMA in
exclusive mode: mmap lock (unless the VMA is detached), VMA lock,
anon_vma, address_space
3. PMD entries that point to page tables can be changed while holding
the page table spinlocks for the entry and the table it points to
4. lowest-level page tables can be in high memory, so they must be
kmapped for access, and pte_offset_map_lock() does that for you
5. entries in "none" state can only be populated with leaf entries
while holding the mmap or vma lock (doing it through the rmap would be
bad because that could race with munmap() zapping data pages in the
region)
6. leaf entries can be zapped (changed to "none") while holding any
one of mmap lock, vma lock, address_space lock, or anon_vma lock

And then the rules for readers mostly follow from that:
1 => holding the appropriate page table lock makes the contents of a
page table stable, except for A/D updates
2 => page table entries higher than PMD level that point to lower page
tables can be followed without taking page table locks
3+4 => following PMD entries pointing to page tables requires careful
locking, and pte_offset_map_lock() does that for you

Ah, though now I see the page table teardown section below already has
some of this information.

> +This is performed via `pte_offset_map_lock()` which carefully checks to ensure
> +that the PTE hasn't changed from under us, ultimately invoking `pte_lockptr()`
> +to obtain a spin lock at PTE granularity contained within the `struct ptdesc`
> +associated with the physical PTE page. The lock must be released via
> +`pte_unmap_unlock()`.

Sidenote: Not your fault that the Linux terminology for this sucks,
but the way this section uses "PTE" to describe a page table rather
than a Page Table Entry is pretty confusing to me... in my head, a
pte_t is a Page Table Entry (PTE), a pte_t* is a Page Table or a Page
Table Entry Pointer (depending on context), a pmd_t is a Page Middle
Directory Entry, and a pmd_t* is a Page Middle Directory or a Page
Middle Directory Entry Pointer. (Though to make things easier I
normally think of them as L1 entry, L1 table, L2 entry, L2 table.)

> +.. note::
> +   There are some variants on this, such as `pte_offset_map_rw_nolock()` when we
> +   know we hold the PTE stable but for brevity we do not explore this.
> +   See the comment for `__pte_offset_map_lock()` for more details.
> +
> +When modifying data in ranges we typically only wish to allocate higher page
> +tables as necessary, using these locks to avoid races or overwriting anything,
> +and set/clear data at the PTE level as required (for instance when page faulting
> +or zapping).
> +
> +Page table teardown
> +-------------------
> +
> +Tearing down page tables themselves is something that requires significant
> +care. There must be no way that page tables designated for removal can be
> +traversed or referenced by concurrent tasks.

(except by hardware or with gup_fast() which behaves roughly like a
hardware page walker and completely ignores what is happening at the
VMA layer)

> +It is insufficient to simply hold an mmap write lock and VMA lock (which will
> +prevent racing faults, and rmap operations), as a file-backed mapping can be
> +truncated under the `struct address_space` i_mmap_lock alone.
> +
> +As a result, no VMA which can be accessed via the reverse mapping (either
> +anon_vma or the `struct address_space->i_mmap` interval tree) can have its page
> +tables torn down.

(except last-level page tables: khugepaged already deletes those for
file mappings without using the mmap lock at all in
retract_page_tables(), and there is a pending series that will do the
same with page tables in other VMAs too, see
<https://lore.kernel.org/all/cover.1729157502.git.zhengqi.arch@bytedance.com/>)

> +The operation is typically performed via `free_pgtables()`, which assumes either
> +the mmap write lock has been taken (as specified by its `mm_wr_locked`
> +parameter), or that it the VMA is fully detached.

nit: s/that it the/that the/

> +It carefully removes the VMA from all reverse mappings, however it's important
> +that no new ones overlap these or any route remain to permit access to addresses
> +within the range whose page tables are being torn down.
> +
> +As a result of these careful conditions, note that page table entries are
> +cleared without page table locks, as it is assumed that all of these precautions
> +have already been taken.

Oh, I didn't realize this... interesting.

> +mmap write lock downgrading
> +---------------------------
> +
> +While it is possible to obtain an mmap write or read lock using the
> +`mm->mmap_lock` read/write semaphore, it is also possible to **downgrade** from
> +a write lock to a read lock via `mmap_write_downgrade()`.
> +
> +Similar to `mmap_write_unlock()`, this implicitly terminates all VMA write locks
> +via `vma_end_write_all()` (more or this behaviour in the VMA lock internals

typo: s/or/on/

> +section below), but importantly does not relinquish the mmap lock while
> +downgrading, therefore keeping the locked virtual address space stable.
> +
> +A subtlety here is that callers can assume, if they invoke an
> +mmap_write_downgrade() operation, that they still have exclusive access to the
> +virtual address space (excluding VMA read lock holders), as for another task to
> +have downgraded they would have had to have exclusive access to the semaphore
> +which can't be the case until the current task completes what it is doing.
> +
> +Stack expansion
> +---------------
> +
> +Stack expansion throws up additional complexities in that we cannot permit there
> +to be racing page faults, as a result we invoke `vma_start_write()` to prevent
> +this in `expand_downwards()` or `expand_upwards()`.

And this needs the mmap lock in write mode, so stack expansion is only
done in codepaths where we can reliably get that - so it happens on
fault handling, but not on GUP. This probably creates the fun quirk
that, in theory, the following scenario could happen:

1. a userspace program creates a large on-stack buffer (which exceeds
the bounds of the current stack VMA but is within the stack size
limit)
2. userspace calls something like the read() syscall on this buffer
(without writing to any deeper part of the stack - so this can't
happen when you call into a non-inlined library function for read() on
x86, but it might happen on arm64, where a function call does not
require writing to the stack)
3. the kernel read() handler is trying to do something like direct I/O
and uses GUP to pin the user-supplied pages (and does not use
copy_to_user(), which would be more common)
4. GUP fails, the read() fails

But this was probably the least bad option to deal with existing stack
expansion issues.

> +Lock ordering
> +-------------
> +
> +As we have multiple locks across the kernel which may or may not be taken at the
> +same time as explicit mm or VMA locks, we have to be wary of lock inversion, and
> +the **order** in which locks are acquired and released becomes very important.
> +
> +.. note::
> +
> +   Lock inversion occurs when two threads need to acquire multiple locks,
> +   but in doing so inadvertently cause a mutual deadlock.
> +
> +   For example, consider thread 1 which holds lock A and tries to acquire lock B,
> +   while thread 2 holds lock B and tries to acquire lock A.
> +
> +   Both threads are now deadlocked on each other. However, had they attempted to
> +   acquire locks in the same order, one would have waited for the other to
> +   complete its work and no deadlock would have occurred.
> +
> +The opening comment in `mm/rmap.c` describes in detail the required ordering of
> +locks within memory management code:
> +
> +.. code-block::
> +
> +  inode->i_rwsem       (while writing or truncating, not reading or faulting)
> +    mm->mmap_lock
> +      mapping->invalidate_lock (in filemap_fault)
> +        folio_lock
> +          hugetlbfs_i_mmap_rwsem_key (in huge_pmd_share, see hugetlbfs below)
> +            vma_start_write
> +              mapping->i_mmap_rwsem
> +                anon_vma->rwsem
> +                  mm->page_table_lock or pte_lock
> +                    swap_lock (in swap_duplicate, swap_info_get)
> +                      mmlist_lock (in mmput, drain_mmlist and others)
> +                      mapping->private_lock (in block_dirty_folio)
> +                          i_pages lock (widely used)
> +                            lruvec->lru_lock (in folio_lruvec_lock_irq)
> +                      inode->i_lock (in set_page_dirty's __mark_inode_dirty)
> +                      bdi.wb->list_lock (in set_page_dirty's __mark_inode_dirty)
> +                        sb_lock (within inode_lock in fs/fs-writeback.c)
> +                        i_pages lock (widely used, in set_page_dirty,
> +                                  in arch-dependent flush_dcache_mmap_lock,
> +                                  within bdi.wb->list_lock in __sync_single_inode)
> +
> +Please check the current state of this comment which may have changed since the
> +time of writing of this document.

I think something like
https://www.kernel.org/doc/html/latest/doc-guide/kernel-doc.html#overview-documentation-comments
is supposed to let you include the current version of the comment into
the rendered documentation HTML without having to manually keep things
in sync. I've never used that myself, but there are a bunch of
examples in the tree; for example, grep for "DMA fences overview".
Mike Rapoport Nov. 2, 2024, 9 a.m. UTC | #5
On Fri, Nov 01, 2024 at 06:50:33PM +0000, Lorenzo Stoakes wrote:
> Locking around VMAs is complicated and confusing. While we have a number of
> disparate comments scattered around the place, we seem to be reaching a
> level of complexity that justifies a serious effort at clearly documenting
> how locks are expected to be interacted with when it comes to interacting
> with mm_struct and vm_area_struct objects.
> 
> This is especially pertinent as regards efforts to find sensible
> abstractions for these fundamental objects within the kernel rust
> abstraction whose compiler strictly requires some means of expressing these
> rules (and through this expression can help self-document these
> requirements as well as enforce them which is an exciting concept).
> 
> The document limits scope to mmap and VMA locks and those that are
> immediately adjacent and relevant to them - so additionally covers page
> table locking as this is so very closely tied to VMA operations (and relies
> upon us handling these correctly).
> 
> The document tries to cover some of the nastier and more confusing edge
> cases and concerns especially around lock ordering and page table teardown.
> 
> The document also provides some VMA lock internals, which are up to date
> and inclusive of recent changes to recent sequence number changes.
> 
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
> 
> REVIEWERS NOTES:
>    You can speed up doc builds by running `make SPHINXDIRS=mm htmldocs`. I
>    also uploaded a copy of this to my website at
>    https://ljs.io/output/mm/vma_locks to make it easier to have a quick
>    read through. Thanks!
> 
> 
>  Documentation/mm/index.rst     |   1 +
>  Documentation/mm/vma_locks.rst | 527 +++++++++++++++++++++++++++++++++
>  2 files changed, 528 insertions(+)
>  create mode 100644 Documentation/mm/vma_locks.rst
> 
> diff --git a/Documentation/mm/index.rst b/Documentation/mm/index.rst
> index 0be1c7503a01..da5f30acaca5 100644
> --- a/Documentation/mm/index.rst
> +++ b/Documentation/mm/index.rst
> @@ -64,3 +64,4 @@ documentation, or deleted if it has served its purpose.
>     vmemmap_dedup
>     z3fold
>     zsmalloc
> +   vma_locks

Please keep the TOC sorted alphabetically.

> diff --git a/Documentation/mm/vma_locks.rst b/Documentation/mm/vma_locks.rst
> new file mode 100644
> index 000000000000..52b9d484376a
> --- /dev/null
> +++ b/Documentation/mm/vma_locks.rst
> @@ -0,0 +1,527 @@
> +VMA Locking
> +===========
> +
> +Overview
> +--------
> +
> +Userland memory ranges are tracked by the kernel via Virtual Memory Areas or
> +'VMA's of type `struct vm_area_struct`.
> +
> +Each VMA describes a virtually contiguous memory range with identical
> +attributes, each of which described by a `struct vm_area_struct`
> +object. Userland access outside of VMAs is invalid except in the case where an
> +adjacent stack VMA could be extended to contain the accessed address.
> +
> +All VMAs are contained within one and only one virtual address space, described
> +by a `struct mm_struct` object which is referenced by all tasks (that is,
> +threads) which share the virtual address space. We refer to this as the `mm`.
> +
> +Each mm object contains a maple tree data structure which describes all VMAs
> +within the virtual address space.
> +
> +The kernel is designed to be highly scalable against concurrent access to
> +userland memory,

"and concurrent changes to the virtual address space layoyt"?

> so a complicated set of locks are required to ensure no data
> +races or memory corruption occurs.
> +
> +This document explores this locking in detail.
> +
> +.. note::
> +
> +   There are three different things that a user might want to achieve via
> +   locks - the first of which is **stability**. That is - ensuring that the VMA
> +   won't be freed or modified in any way from underneath us.
> +
> +   All MM and VMA locks ensure stability.
> +
> +   Secondly we have locks which allow **reads** but not writes (and which might
> +   be held concurrent with other CPUs who also hold the read lock).

I think it should be clarified here that *reads* are from data structures
rather than user memory.

> +
> +   Finally, we have locks which permit exclusive access to the VMA to allow for

                                                                      ^ object
> +   **writes** to the VMA.
> +
> +MM and VMA locks
> +----------------
> +
> +There are two key classes of lock utilised when reading and manipulating VMAs -
> +the `mmap_lock` which is a read/write semaphore maintained at the `mm_struct`
> +level of granularity and, if CONFIG_PER_VMA_LOCK is set, a per-VMA lock at the
> +VMA level of granularity.
> +
> +.. note::
> +
> +   Generally speaking, a read/write semaphore is a class of lock which permits
> +   concurrent readers. However a write lock can only be obtained once all
> +   readers have left the critical region (and pending readers made to wait).
> +
> +   This renders read locks on a read/write semaphore concurrent with other
> +   readers and write locks exclusive against all others holding the semaphore.
> +
> +If CONFIG_PER_VMA_LOCK is not set, then things are relatively simple - a write
> +mmap lock gives you exclusive write access to a VMA, and a read lock gives you
> +concurrent read-only access.
> +
> +In the presence of CONFIG_PER_VMA_LOCK, i.e. VMA locks, things are more
> +complicated. In this instance, a write semaphore is no longer enough to gain
> +exclusive access to a VMA, a VMA write lock is also required.
> +
> +The VMA lock is implemented via the use of both a read/write semaphore and
> +per-VMA and per-mm sequence numbers. We go into detail on this in the VMA lock
> +internals section below, so for the time being it is important only to note that
> +we can obtain either a VMA read or write lock.
> +
> +.. note::

I don't think the below text should be a "note", I'd just keep at
continuation of the section

> +
> +   VMAs under VMA **read** lock are obtained by the `lock_vma_under_rcu()`
> +   function, and **no** existing mmap or VMA lock must be held, This function
> +   either returns a read-locked VMA, or NULL if the lock could not be
> +   acquired. As the name suggests, the VMA will be acquired under RCU, though
> +   once obtained, remains stable.
> +
> +   This kind of locking is entirely optimistic - if the lock is contended or a
> +   competing write has started, then we do not obtain a read lock.
> +
> +   The `lock_vma_under_rcu()` function first calls `rcu_read_lock()` to ensure
> +   that the VMA is acquired in an RCU critical section, then attempts to VMA
> +   lock it via `vma_start_read()`, before releasing the RCU lock via
> +   `rcu_read_unlock()`.
> +
> +   VMA read locks hold the a read lock on the `vma->vm_lock` semaphore for their

                         ^ no idea if it should be 'a' or 'the', but surely
not both :)

> +   duration and the caller of `lock_vma_under_rcu()` must release it via
> +   `vma_end_read()`.
> +
> +   VMA **write** locks are acquired via `vma_start_write()` in instances where a
> +   VMA is about to be modified, unlike `vma_start_read()` the lock is always
> +   acquired. An mmap write lock **must** be held for the duration of the VMA
> +   write lock, releasing or downgrading the mmap write lock also releases the
> +   VMA write lock so there is no `vma_end_write()` function.
> +
> +   Note that a semaphore write lock is not held across a VMA lock. Rather, a
> +   sequence number is used for serialisation, and the write semaphore is only
> +   acquired at the point of write lock to update this (we explore this in detail
> +   in the VMA lock internals section below).
> +
> +   This ensures the semantics we require - VMA write locks provide exclusive
> +   write access to the VMA.
> +
> +Examining all valid lock state and what each implies:
> +
> +.. list-table::
> +   :header-rows: 1

Can we make it just .. table:

.. table::

    ========= ======== ======= ================ =================
    mmap lock VMA lock Stable? Can read safely? Can write safely?
    ========= ======== ======= ================ =================
    \-        \-       N       N                N
    R         \-       Y       Y                N
    \-        R        Y       Y                N
    W         \-       Y       Y                N
    W         W        Y       Y                Y
    ========= ======== ======= ================ =================

> +
> +   * - mmap lock
> +     - VMA lock
> +     - Stable?
> +     - Can read safely?
> +     - Can write safely?
> +   * - \-
> +     - \-
> +     - N
> +     - N
> +     - N
> +   * - R
> +     - \-
> +     - Y
> +     - Y
> +     - N
> +   * - \-
> +     - R
> +     - Y
> +     - Y
> +     - N
> +   * - W
> +     - \-
> +     - Y
> +     - Y
> +     - N
> +   * - W
> +     - W
> +     - Y
> +     - Y
> +     - Y
> +
> +Note that there are some exceptions to this - the `anon_vma` field is permitted
> +to be written to under mmap read lock and is instead serialised by the `struct
> +mm_struct` field `page_table_lock`. In addition the `vm_mm` and all
> +lock-specific fields are permitted to be read under RCU alone  (though stability cannot
> +be expected in this instance).
> +
> +.. note::
> +   The most notable place to use the VMA read lock is on page table faults on
> +   the x86-64 architecture, which importantly means that without a VMA write
> +   lock, page faults can race against you even if you hold an mmap write lock.
> +
> +VMA Fields
> +----------
> +
> +We examine each field of the `struct vm_area_struct` type in detail in the table
> +below.
> +
> +Reading of each field requires either an mmap read lock or a VMA read lock to be
> +held, except where 'unstable RCU read' is specified, in which case unstable
> +access to the field is permitted under RCU alone.
> +
> +The table specifies which write locks must be held to write to the field.
> +
> +.. list-table::
> +   :widths: 20 10 22 5 20
> +   :header-rows: 1

And use .. table here as well, e.g

.. table::

    ======== ======== ========================== ================== ==========
    Field    Config   Description                Unstable RCU read? Write lock
    ======== ======== ========================== ================== ==========
    vm_start          Inclusive start virtual                       mmap write,
                      address of range VMA                          VMA write
		      describes

    vm_end            Exclusive end virtual                         mmap write,
                      address of range VMA                          VMA write
		      describes

    vm_rcu   vma_lock RCU list head, in union    N/A                N/A
                      with vma_start, vma_end.
		      RCU implementation detail
    ======== ======== ========================== ================== ==========


> +
> +   * - Field
> +     - Config
> +     - Description
> +     - Unstable RCU read?
> +     - Write Lock
> +   * - vm_start
> +     -
> +     - Inclusive start virtual address of range VMA describes.
> +     -
> +     - mmap write, VMA write
> +   * - vm_end
> +     -
> +     - Exclusive end virtual address of range VMA describes.
> +     -
> +     - mmap write, VMA write
> +   * - vm_rcu
> +     - vma lock
> +     - RCU list head, in union with vma_start, vma_end. RCU implementation detail.
> +     - N/A
> +     - N/A
> +   * - vm_mm
> +     -
> +     - Containing mm_struct.
> +     - Y
> +     - (Static)
> +   * - vm_page_prot
> +     -
> +     - Architecture-specific page table protection bits determined from VMA
> +       flags
> +     -
> +     - mmap write, VMA write
> +   * - vm_flags
> +     -
> +     - Read-only access to VMA flags describing attributes of VMA, in union with
> +       private writable `__vm_flags`.
> +     -
> +     - N/A
> +   * - __vm_flags
> +     -
> +     - Private, writable access to VMA flags field, updated by vm_flags_*()
> +       functions.
> +     -
> +     - mmap write, VMA write
> +   * - detached
> +     - vma lock
> +     - VMA lock implementation detail - indicates whether the VMA has been
> +       detached from the tree.
> +     - Y
> +     - mmap write, VMA write
> +   * - vm_lock_seq
> +     - vma lock
> +     - VMA lock implementation detail - A sequence number used to serialise the
> +       VMA lock, see the VMA lock section below.
> +     - Y
> +     - mmap write, VMA write
> +   * - vm_lock
> +     - vma lock
> +     - VMA lock implementation detail - A pointer to the VMA lock read/write
> +       semaphore.
> +     - Y
> +     - None required
> +   * - shared.rb
> +     -
> +     - A red/black tree node used, if the mapping is file-backed, to place the
> +       VMA in the `struct address_space->i_mmap` red/black interval tree.
> +     -
> +     - mmap write, VMA write, i_mmap write
> +   * - shared.rb_subtree_last
> +     -
> +     - Metadata used for management of the interval tree if the VMA is
> +       file-backed.
> +     -
> +     - mmap write, VMA write, i_mmap write
> +   * - anon_vma_chain
> +     -
> +     - List of links to forked/CoW'd `anon_vma` objects.
> +     -
> +     - mmap read or above, anon_vma write lock
> +   * - anon_vma
> +     -
> +     - `anon_vma` object used by anonymous folios mapped exclusively to this VMA.
> +     -
> +     - mmap read or above, page_table_lock
> +   * - vm_ops
> +     -
> +     - If the VMA is file-backed, then either the driver or file-system provides
> +       a `struct vm_operations_struct` object describing callbacks to be invoked
> +       on specific VMA lifetime events.
> +     -
> +     - (Static)
> +   * - vm_pgoff
> +     -
> +     - Describes the page offset into the file, the original page offset within
> +       the virtual address space (prior to any `mremap()`), or PFN if a PFN map.
> +     -
> +     - mmap write, VMA write
> +   * - vm_file
> +     -
> +     - If the VMA is file-backed, points to a `struct file` object describing
> +       the underlying file, if anonymous then `NULL`.
> +     -
> +     - (Static)
> +   * - vm_private_data
> +     -
> +     - A `void *` field for driver-specific metadata.
> +     -
> +     - Driver-mandated.
> +   * - anon_name
> +     - anon name
> +     - A field for storing a `struct anon_vma_name` object providing a name for
> +       anonymous mappings, or `NULL` if none is set or the VMA is file-backed.
> +     -
> +     - mmap write, VMA write
> +   * - swap_readahead_info
> +     - swap
> +     - Metadata used by the swap mechanism to perform readahead.
> +     -
> +     - mmap read
> +   * - vm_region
> +     - nommu
> +     - The containing region for the VMA for architectures which do not
> +       possess an MMU.
> +     - N/A
> +     - N/A
> +   * - vm_policy
> +     - numa
> +     - `mempolicy` object which describes NUMA behaviour of the VMA.
> +     -
> +     - mmap write, VMA write
> +   * - numab_state
> +     - numab
> +     - `vma_numab_state` object which describes the current state of NUMA
> +       balancing in relation to this VMA.
> +     -
> +     - mmap write, VMA write
> +   * - vm_userfaultfd_ctx
> +     -
> +     - Userfaultfd context wrapper object of type `vm_userfaultfd_ctx`, either
> +       of zero size if userfaultfd is disabled, or containing a pointer to an
> +       underlying `userfaultfd_ctx` object which describes userfaultfd metadata.
> +     -
> +     - mmap write, VMA write
> +
> +.. note::
> +
> +   In the config column 'vma lock' configuration means CONFIG_PER_VMA_LOCK,
> +   'anon name' means CONFIG_ANON_VMA_NAME, 'swap' means CONFIG_SWAP, 'nommu'
> +   means that CONFIG_MMU is not set, 'numa' means CONFIG_NUMA and 'numab' means
> +   CONFIG_NUMA_BALANCING'.
> +
> +   In the write lock column '(Static)' means that the field is set only once
> +   upon initialisation of the VMA and not changed after this, the VMA would
> +   either have been under an mmap write and VMA write lock at the time or not
> +   yet inserted into any tree.
> +
> +Page table locks
> +----------------
> +
> +When allocating a P4D, PUD or PMD and setting the relevant entry in the above
> +PGD, P4D or PUD, the `mm->page_table_lock` is acquired to do so. This is
> +acquired in `__p4d_alloc()`, `__pud_alloc()` and `__pmd_alloc()` respectively.
> +
> +.. note::
> +   `__pmd_alloc()` actually invokes `pud_lock()` and `pud_lockptr()` in turn,
> +   however at the time of writing it ultimately references the
> +   `mm->page_table_lock`.
> +
> +Allocating a PTE will either use the `mm->page_table_lock` or, if
> +`USE_SPLIT_PMD_PTLOCKS` is defined, used a lock embedded in the PMD physical
> +page metadata in the form of a `struct ptdesc`, acquired by `pmd_ptdesc()`
> +called from `pmd_lock()` and ultimately `__pte_alloc()`.
> +
> +Finally, modifying the contents of the PTE has special treatment, as this is a
> +lock that we must acquire whenever we want stable and exclusive access to
> +entries pointing to data pages within a PTE, especially when we wish to modify
> +them.
> +
> +This is performed via `pte_offset_map_lock()` which carefully checks to ensure
> +that the PTE hasn't changed from under us, ultimately invoking `pte_lockptr()`
> +to obtain a spin lock at PTE granularity contained within the `struct ptdesc`
> +associated with the physical PTE page. The lock must be released via
> +`pte_unmap_unlock()`.
> +
> +.. note::
> +   There are some variants on this, such as `pte_offset_map_rw_nolock()` when we
> +   know we hold the PTE stable but for brevity we do not explore this.
> +   See the comment for `__pte_offset_map_lock()` for more details.
> +
> +When modifying data in ranges we typically only wish to allocate higher page
> +tables as necessary, using these locks to avoid races or overwriting anything,
> +and set/clear data at the PTE level as required (for instance when page faulting
> +or zapping).
> +
> +Page table teardown
> +-------------------
> +
> +Tearing down page tables themselves is something that requires significant
> +care. There must be no way that page tables designated for removal can be
> +traversed or referenced by concurrent tasks.
> +
> +It is insufficient to simply hold an mmap write lock and VMA lock (which will
> +prevent racing faults, and rmap operations), as a file-backed mapping can be
> +truncated under the `struct address_space` i_mmap_lock alone.
> +
> +As a result, no VMA which can be accessed via the reverse mapping (either
> +anon_vma or the `struct address_space->i_mmap` interval tree) can have its page
> +tables torn down.
> +
> +The operation is typically performed via `free_pgtables()`, which assumes either
> +the mmap write lock has been taken (as specified by its `mm_wr_locked`
> +parameter), or that it the VMA is fully detached.

               "or that the VMA is..." ?

> +It carefully removes the VMA from all reverse mappings, however it's important
> +that no new ones overlap these or any route remain to permit access to addresses
> +within the range whose page tables are being torn down.
> +
> +As a result of these careful conditions, note that page table entries are
> +cleared without page table locks, as it is assumed that all of these precautions
> +have already been taken.
> +
> +mmap write lock downgrading
> +---------------------------
> +
> +While it is possible to obtain an mmap write or read lock using the
> +`mm->mmap_lock` read/write semaphore, it is also possible to **downgrade** from
> +a write lock to a read lock via `mmap_write_downgrade()`.
> +
> +Similar to `mmap_write_unlock()`, this implicitly terminates all VMA write locks
> +via `vma_end_write_all()` (more or this behaviour in the VMA lock internals
> +section below), but importantly does not relinquish the mmap lock while
> +downgrading, therefore keeping the locked virtual address space stable.
> +
> +A subtlety here is that callers can assume, if they invoke an
> +mmap_write_downgrade() operation, that they still have exclusive access to the
> +virtual address space (excluding VMA read lock holders), as for another task to
> +have downgraded they would have had to have exclusive access to the semaphore
> +which can't be the case until the current task completes what it is doing.
> +
> +Stack expansion
> +---------------
> +
> +Stack expansion throws up additional complexities in that we cannot permit there
> +to be racing page faults, as a result we invoke `vma_start_write()` to prevent
> +this in `expand_downwards()` or `expand_upwards()`.
> +
> +Lock ordering
> +-------------
> +
> +As we have multiple locks across the kernel which may or may not be taken at the
> +same time as explicit mm or VMA locks, we have to be wary of lock inversion, and
> +the **order** in which locks are acquired and released becomes very important.
> +
> +.. note::
> +
> +   Lock inversion occurs when two threads need to acquire multiple locks,
> +   but in doing so inadvertently cause a mutual deadlock.
> +
> +   For example, consider thread 1 which holds lock A and tries to acquire lock B,
> +   while thread 2 holds lock B and tries to acquire lock A.
> +
> +   Both threads are now deadlocked on each other. However, had they attempted to
> +   acquire locks in the same order, one would have waited for the other to
> +   complete its work and no deadlock would have occurred.
> +
> +The opening comment in `mm/rmap.c` describes in detail the required ordering of
> +locks within memory management code:
> +
> +.. code-block::
> +
> +  inode->i_rwsem	(while writing or truncating, not reading or faulting)
> +    mm->mmap_lock
> +      mapping->invalidate_lock (in filemap_fault)
> +        folio_lock
> +          hugetlbfs_i_mmap_rwsem_key (in huge_pmd_share, see hugetlbfs below)
> +            vma_start_write
> +              mapping->i_mmap_rwsem
> +                anon_vma->rwsem
> +                  mm->page_table_lock or pte_lock
> +                    swap_lock (in swap_duplicate, swap_info_get)
> +                      mmlist_lock (in mmput, drain_mmlist and others)
> +                      mapping->private_lock (in block_dirty_folio)
> +                          i_pages lock (widely used)
> +                            lruvec->lru_lock (in folio_lruvec_lock_irq)
> +                      inode->i_lock (in set_page_dirty's __mark_inode_dirty)
> +                      bdi.wb->list_lock (in set_page_dirty's __mark_inode_dirty)
> +                        sb_lock (within inode_lock in fs/fs-writeback.c)
> +                        i_pages lock (widely used, in set_page_dirty,
> +                                  in arch-dependent flush_dcache_mmap_lock,
> +                                  within bdi.wb->list_lock in __sync_single_inode)
> +
> +Please check the current state of this comment which may have changed since the
> +time of writing of this document.
> +
> +VMA lock internals
> +------------------
> +
> +The VMA lock mechanism is designed to be a lightweight means of avoiding the use
> +of the heavily contended mmap lock. It is implemented using a combination of a
> +read/write semaphore and sequence numbers belonging to the containing `struct
> +mm_struct` and the VMA.
> +
> +Read locks are acquired via `vma_start_read()`, which is an optimistic
> +operation, i.e. it tries to acquire a read lock but returns false if it is
> +unable to do so. At the end of the read operation, `vma_end_read()` is called to
> +release the VMA read lock. This can be done under RCU alone.
> +
> +Writing requires the mmap to be write-locked and the VMA lock to be acquired via
> +`vma_start_write()`, however the write lock is released by the termination or
> +downgrade of the mmap write lock so no `vma_end_write()` is required.
> +
> +All this is achieved by the use of per-mm and per-VMA sequence counts. This is
> +used to reduce complexity, and potential especially around operations which
> +write-lock multiple VMAs at once.
> +
> +If the mm sequence count, `mm->mm_lock_seq` is equal to the VMA sequence count
> +`vma->vm_lock_seq` then the VMA is write-locked. If they differ, then they are
> +not.
> +
> +Each time an mmap write lock is acquired in `mmap_write_lock()`,
> +`mmap_write_lock_nested()`, `mmap_write_lock_killable()`, the `mm->mm_lock_seq`
> +sequence number is incremented via `mm_lock_seqcount_begin()`.
> +
> +Each time the mmap write lock is released in `mmap_write_unlock()` or
> +`mmap_write_downgrade()`, `vma_end_write_all()` is invoked which also increments
> +`mm->mm_lock_seq` via `mm_lock_seqcount_end()`.
> +
> +This way, we ensure regardless of the VMA's sequence number count, that a write
> +lock is not incorrectly indicated (since we increment the sequence counter on
> +acquiring the mmap write lock, which is required in order to obtain a VMA write
> +lock), and that when we release an mmap write lock, we efficiently release
> +**all** VMA write locks contained within the mmap at the same time.
> +
> +The exclusivity of the mmap write lock ensures this is what we want, as there
> +would never be a reason to persist per-VMA write locks across multiple mmap
> +write lock acquisitions.
> +
> +Each time a VMA read lock is acquired, we acquire a read lock on the
> +`vma->vm_lock` read/write semaphore and hold it, while checking that the
> +sequence count of the VMA does not match that of the mm.
> +
> +If it does, the read lock fails. If it does not, we hold the lock, excluding
> +writers, but permitting other readers, who will also obtain this lock under RCU.
> +
> +Importantly, maple tree operations performed in `lock_vma_under_rcu()` are also
> +RCU safe, so the whole read lock operation is guaranteed to function correctly.
> +
> +On the write side, we acquire a write lock on the `vma->vm_lock` read/write
> +semaphore, before setting the VMA's sequence number under this lock, also
> +simultaneously holding the mmap write lock.
> +
> +This way, if any read locks are in effect, `vma_start_write()` will sleep until
> +these are finished and mutual exclusion is achieved.
> +
> +After setting the VMA's sequence number, the lock is released, avoiding
> +complexity with a long-term held write lock.
> +
> +This clever combination of a read/write semaphore and sequence count allows for
> +fast RCU-based per-VMA lock acquisition (especially on x86-64 page fault, though
> +utilised elsewhere) with minimal complexity around lock ordering.
> --
> 2.47.0
Lorenzo Stoakes Nov. 4, 2024, 1:02 p.m. UTC | #6
On Fri, Nov 01, 2024 at 04:48:32PM -0700, SeongJae Park wrote:
> On Fri, 1 Nov 2024 20:58:39 +0000 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
>
> [...]
> > On Fri, Nov 01, 2024 at 06:50:33PM +0000, Lorenzo Stoakes wrote:
> > > Locking around VMAs is complicated and confusing. While we have a number of
> > > disparate comments scattered around the place, we seem to be reaching a
> > > level of complexity that justifies a serious effort at clearly documenting
> > > how locks are expected to be interacted with when it comes to interacting
> > > with mm_struct and vm_area_struct objects.
> > >
> > > This is especially pertinent as regards efforts to find sensible
> > > abstractions for these fundamental objects within the kernel rust
> > > abstraction whose compiler strictly requires some means of expressing these
> > > rules (and through this expression can help self-document these
> > > requirements as well as enforce them which is an exciting concept).
> > >
> > > The document limits scope to mmap and VMA locks and those that are
> > > immediately adjacent and relevant to them - so additionally covers page
> > > table locking as this is so very closely tied to VMA operations (and relies
> > > upon us handling these correctly).
> > >
> > > The document tries to cover some of the nastier and more confusing edge
> > > cases and concerns especially around lock ordering and page table teardown.
> > >
> > > The document also provides some VMA lock internals, which are up to date
> > > and inclusive of recent changes to recent sequence number changes.
> > >
> > > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>
> Acked-by: SeongJae Park <sj@kernel.org>

Thanks :)

>
> > > ---
> > >
> > > REVIEWERS NOTES:
> > >    You can speed up doc builds by running `make SPHINXDIRS=mm htmldocs`. I
> > >    also uploaded a copy of this to my website at
> > >    https://ljs.io/output/mm/vma_locks to make it easier to have a quick
> > >    read through. Thanks!
> > >
> > >
> > >  Documentation/mm/index.rst     |   1 +
> > >  Documentation/mm/vma_locks.rst | 527 +++++++++++++++++++++++++++++++++
> > >  2 files changed, 528 insertions(+)
> > >  create mode 100644 Documentation/mm/vma_locks.rst
> > >
> > > diff --git a/Documentation/mm/index.rst b/Documentation/mm/index.rst
> > > index 0be1c7503a01..da5f30acaca5 100644
> > > --- a/Documentation/mm/index.rst
> > > +++ b/Documentation/mm/index.rst
> > > @@ -64,3 +64,4 @@ documentation, or deleted if it has served its purpose.
> > >     vmemmap_dedup
> > >     z3fold
> > >     zsmalloc
> > > +   vma_locks
>
> This is the "Unsorted Documentation" section.  If the document is really for
> the section, I'd suggest putting it in alphabetically sorted order, for the
> consistency.  However, if putting the document under the section is not your
> real intention, I think it might be better to be put under "Process Addresses"
> section above.  What do you think?

Well, at the moment it's sort of a WIP thing that we may want to put under
another section, was just putting there somewhat arbitrarily for now.

I also wanted to avoid too much debate about what to put where :P

But absolutely, ack, will either sort it there or put it somewhere more
sensible, thanks!

>
>
> Thanks,
> SJ
>
> [...]
Mike Rapoport Nov. 4, 2024, 1:47 p.m. UTC | #7
On Mon, Nov 04, 2024 at 01:02:19PM +0000, Lorenzo Stoakes wrote:
> On Fri, Nov 01, 2024 at 04:48:32PM -0700, SeongJae Park wrote:
> >
> > This is the "Unsorted Documentation" section.  If the document is really for
> > the section, I'd suggest putting it in alphabetically sorted order, for the
> > consistency.  However, if putting the document under the section is not your
> > real intention, I think it might be better to be put under "Process Addresses"
> > section above.  What do you think?
> 
> Well, at the moment it's sort of a WIP thing that we may want to put under
> another section, was just putting there somewhat arbitrarily for now.
> 
> I also wanted to avoid too much debate about what to put where :P
> 
> But absolutely, ack, will either sort it there or put it somewhere more
> sensible, thanks!

Don't mean to bikeshed, but it would make sense to put it to the "Process
Address (space)" part :)
Lorenzo Stoakes Nov. 4, 2024, 2:17 p.m. UTC | #8
On Sat, Nov 02, 2024 at 11:00:20AM +0200, Mike Rapoport wrote:
> On Fri, Nov 01, 2024 at 06:50:33PM +0000, Lorenzo Stoakes wrote:
> > Locking around VMAs is complicated and confusing. While we have a number of
> > disparate comments scattered around the place, we seem to be reaching a
> > level of complexity that justifies a serious effort at clearly documenting
> > how locks are expected to be interacted with when it comes to interacting
> > with mm_struct and vm_area_struct objects.
> >
> > This is especially pertinent as regards efforts to find sensible
> > abstractions for these fundamental objects within the kernel rust
> > abstraction whose compiler strictly requires some means of expressing these
> > rules (and through this expression can help self-document these
> > requirements as well as enforce them which is an exciting concept).
> >
> > The document limits scope to mmap and VMA locks and those that are
> > immediately adjacent and relevant to them - so additionally covers page
> > table locking as this is so very closely tied to VMA operations (and relies
> > upon us handling these correctly).
> >
> > The document tries to cover some of the nastier and more confusing edge
> > cases and concerns especially around lock ordering and page table teardown.
> >
> > The document also provides some VMA lock internals, which are up to date
> > and inclusive of recent changes to recent sequence number changes.
> >
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > ---
> >
> > REVIEWERS NOTES:
> >    You can speed up doc builds by running `make SPHINXDIRS=mm htmldocs`. I
> >    also uploaded a copy of this to my website at
> >    https://ljs.io/output/mm/vma_locks to make it easier to have a quick
> >    read through. Thanks!
> >
> >
> >  Documentation/mm/index.rst     |   1 +
> >  Documentation/mm/vma_locks.rst | 527 +++++++++++++++++++++++++++++++++
> >  2 files changed, 528 insertions(+)
> >  create mode 100644 Documentation/mm/vma_locks.rst
> >
> > diff --git a/Documentation/mm/index.rst b/Documentation/mm/index.rst
> > index 0be1c7503a01..da5f30acaca5 100644
> > --- a/Documentation/mm/index.rst
> > +++ b/Documentation/mm/index.rst
> > @@ -64,3 +64,4 @@ documentation, or deleted if it has served its purpose.
> >     vmemmap_dedup
> >     z3fold
> >     zsmalloc
> > +   vma_locks
>
> Please keep the TOC sorted alphabetically.

Ack, as per response to SJ, will address.

>
> > diff --git a/Documentation/mm/vma_locks.rst b/Documentation/mm/vma_locks.rst
> > new file mode 100644
> > index 000000000000..52b9d484376a
> > --- /dev/null
> > +++ b/Documentation/mm/vma_locks.rst
> > @@ -0,0 +1,527 @@
> > +VMA Locking
> > +===========
> > +
> > +Overview
> > +--------
> > +
> > +Userland memory ranges are tracked by the kernel via Virtual Memory Areas or
> > +'VMA's of type `struct vm_area_struct`.
> > +
> > +Each VMA describes a virtually contiguous memory range with identical
> > +attributes, each of which described by a `struct vm_area_struct`
> > +object. Userland access outside of VMAs is invalid except in the case where an
> > +adjacent stack VMA could be extended to contain the accessed address.
> > +
> > +All VMAs are contained within one and only one virtual address space, described
> > +by a `struct mm_struct` object which is referenced by all tasks (that is,
> > +threads) which share the virtual address space. We refer to this as the `mm`.
> > +
> > +Each mm object contains a maple tree data structure which describes all VMAs
> > +within the virtual address space.
> > +
> > +The kernel is designed to be highly scalable against concurrent access to
> > +userland memory,
>
> "and concurrent changes to the virtual address space layoyt"?

Well, not sure that's quite true to be honest, because we go out of our way to
exclude other users when we change the address space layout. Really I was
getting at the fact you can have simultaneous readers and especially, with VMA
locks, simultaneous fault handlers.

Will update to reference read operations against VMAs.

>
> > so a complicated set of locks are required to ensure no data
> > +races or memory corruption occurs.
> > +
> > +This document explores this locking in detail.
> > +
> > +.. note::
> > +
> > +   There are three different things that a user might want to achieve via
> > +   locks - the first of which is **stability**. That is - ensuring that the VMA
> > +   won't be freed or modified in any way from underneath us.
> > +
> > +   All MM and VMA locks ensure stability.
> > +
> > +   Secondly we have locks which allow **reads** but not writes (and which might
> > +   be held concurrent with other CPUs who also hold the read lock).
>
> I think it should be clarified here that *reads* are from data structures
> rather than user memory.

Ack will update.

>
> > +
> > +   Finally, we have locks which permit exclusive access to the VMA to allow for
>
>                                                                       ^ object
> > +   **writes** to the VMA.

Ack

> > +
> > +MM and VMA locks
> > +----------------
> > +
> > +There are two key classes of lock utilised when reading and manipulating VMAs -
> > +the `mmap_lock` which is a read/write semaphore maintained at the `mm_struct`
> > +level of granularity and, if CONFIG_PER_VMA_LOCK is set, a per-VMA lock at the
> > +VMA level of granularity.
> > +
> > +.. note::
> > +
> > +   Generally speaking, a read/write semaphore is a class of lock which permits
> > +   concurrent readers. However a write lock can only be obtained once all
> > +   readers have left the critical region (and pending readers made to wait).
> > +
> > +   This renders read locks on a read/write semaphore concurrent with other
> > +   readers and write locks exclusive against all others holding the semaphore.
> > +
> > +If CONFIG_PER_VMA_LOCK is not set, then things are relatively simple - a write
> > +mmap lock gives you exclusive write access to a VMA, and a read lock gives you
> > +concurrent read-only access.
> > +
> > +In the presence of CONFIG_PER_VMA_LOCK, i.e. VMA locks, things are more
> > +complicated. In this instance, a write semaphore is no longer enough to gain
> > +exclusive access to a VMA, a VMA write lock is also required.
> > +
> > +The VMA lock is implemented via the use of both a read/write semaphore and
> > +per-VMA and per-mm sequence numbers. We go into detail on this in the VMA lock
> > +internals section below, so for the time being it is important only to note that
> > +we can obtain either a VMA read or write lock.
> > +
> > +.. note::
>
> I don't think the below text should be a "note", I'd just keep at
> continuation of the section

Ack will change.

>
> > +
> > +   VMAs under VMA **read** lock are obtained by the `lock_vma_under_rcu()`
> > +   function, and **no** existing mmap or VMA lock must be held, This function
> > +   either returns a read-locked VMA, or NULL if the lock could not be
> > +   acquired. As the name suggests, the VMA will be acquired under RCU, though
> > +   once obtained, remains stable.
> > +
> > +   This kind of locking is entirely optimistic - if the lock is contended or a
> > +   competing write has started, then we do not obtain a read lock.
> > +
> > +   The `lock_vma_under_rcu()` function first calls `rcu_read_lock()` to ensure
> > +   that the VMA is acquired in an RCU critical section, then attempts to VMA
> > +   lock it via `vma_start_read()`, before releasing the RCU lock via
> > +   `rcu_read_unlock()`.
> > +
> > +   VMA read locks hold the a read lock on the `vma->vm_lock` semaphore for their
>
>                          ^ no idea if it should be 'a' or 'the', but surely
> not both :)

Sometimes you have to put both _just to be sure_ ;) nah only joking, will drop
the 'a'...

>
> > +   duration and the caller of `lock_vma_under_rcu()` must release it via
> > +   `vma_end_read()`.
> > +
> > +   VMA **write** locks are acquired via `vma_start_write()` in instances where a
> > +   VMA is about to be modified, unlike `vma_start_read()` the lock is always
> > +   acquired. An mmap write lock **must** be held for the duration of the VMA
> > +   write lock, releasing or downgrading the mmap write lock also releases the
> > +   VMA write lock so there is no `vma_end_write()` function.
> > +
> > +   Note that a semaphore write lock is not held across a VMA lock. Rather, a
> > +   sequence number is used for serialisation, and the write semaphore is only
> > +   acquired at the point of write lock to update this (we explore this in detail
> > +   in the VMA lock internals section below).
> > +
> > +   This ensures the semantics we require - VMA write locks provide exclusive
> > +   write access to the VMA.
> > +
> > +Examining all valid lock state and what each implies:
> > +
> > +.. list-table::
> > +   :header-rows: 1
>
> Can we make it just .. table:

I didn't know you could do tables. I am new to rst... :) Will do!

>
> .. table::
>
>     ========= ======== ======= ================ =================
>     mmap lock VMA lock Stable? Can read safely? Can write safely?
>     ========= ======== ======= ================ =================
>     \-        \-       N       N                N
>     R         \-       Y       Y                N
>     \-        R        Y       Y                N
>     W         \-       Y       Y                N
>     W         W        Y       Y                Y
>     ========= ======== ======= ================ =================
>
> > +
> > +   * - mmap lock
> > +     - VMA lock
> > +     - Stable?
> > +     - Can read safely?
> > +     - Can write safely?
> > +   * - \-
> > +     - \-
> > +     - N
> > +     - N
> > +     - N
> > +   * - R
> > +     - \-
> > +     - Y
> > +     - Y
> > +     - N
> > +   * - \-
> > +     - R
> > +     - Y
> > +     - Y
> > +     - N
> > +   * - W
> > +     - \-
> > +     - Y
> > +     - Y
> > +     - N
> > +   * - W
> > +     - W
> > +     - Y
> > +     - Y
> > +     - Y
> > +
> > +Note that there are some exceptions to this - the `anon_vma` field is permitted
> > +to be written to under mmap read lock and is instead serialised by the `struct
> > +mm_struct` field `page_table_lock`. In addition the `vm_mm` and all
> > +lock-specific fields are permitted to be read under RCU alone  (though stability cannot
> > +be expected in this instance).
> > +
> > +.. note::
> > +   The most notable place to use the VMA read lock is on page table faults on
> > +   the x86-64 architecture, which importantly means that without a VMA write
> > +   lock, page faults can race against you even if you hold an mmap write lock.
> > +
> > +VMA Fields
> > +----------
> > +
> > +We examine each field of the `struct vm_area_struct` type in detail in the table
> > +below.
> > +
> > +Reading of each field requires either an mmap read lock or a VMA read lock to be
> > +held, except where 'unstable RCU read' is specified, in which case unstable
> > +access to the field is permitted under RCU alone.
> > +
> > +The table specifies which write locks must be held to write to the field.
> > +
> > +.. list-table::
> > +   :widths: 20 10 22 5 20
> > +   :header-rows: 1
>
> And use .. table here as well, e.g

Hm this one is a little less clearly worth it because not only will that take me
ages but it'll be quite difficult to read in a sensible editor. I can if you
insist though?

>
> .. table::
>
>     ======== ======== ========================== ================== ==========
>     Field    Config   Description                Unstable RCU read? Write lock
>     ======== ======== ========================== ================== ==========
>     vm_start          Inclusive start virtual                       mmap write,
>                       address of range VMA                          VMA write
> 		      describes
>
>     vm_end            Exclusive end virtual                         mmap write,
>                       address of range VMA                          VMA write
> 		      describes
>
>     vm_rcu   vma_lock RCU list head, in union    N/A                N/A
>                       with vma_start, vma_end.
> 		      RCU implementation detail
>     ======== ======== ========================== ================== ==========
>
>
> > +
> > +   * - Field
> > +     - Config
> > +     - Description
> > +     - Unstable RCU read?
> > +     - Write Lock
> > +   * - vm_start
> > +     -
> > +     - Inclusive start virtual address of range VMA describes.
> > +     -
> > +     - mmap write, VMA write
> > +   * - vm_end
> > +     -
> > +     - Exclusive end virtual address of range VMA describes.
> > +     -
> > +     - mmap write, VMA write
> > +   * - vm_rcu
> > +     - vma lock
> > +     - RCU list head, in union with vma_start, vma_end. RCU implementation detail.
> > +     - N/A
> > +     - N/A
> > +   * - vm_mm
> > +     -
> > +     - Containing mm_struct.
> > +     - Y
> > +     - (Static)
> > +   * - vm_page_prot
> > +     -
> > +     - Architecture-specific page table protection bits determined from VMA
> > +       flags
> > +     -
> > +     - mmap write, VMA write
> > +   * - vm_flags
> > +     -
> > +     - Read-only access to VMA flags describing attributes of VMA, in union with
> > +       private writable `__vm_flags`.
> > +     -
> > +     - N/A
> > +   * - __vm_flags
> > +     -
> > +     - Private, writable access to VMA flags field, updated by vm_flags_*()
> > +       functions.
> > +     -
> > +     - mmap write, VMA write
> > +   * - detached
> > +     - vma lock
> > +     - VMA lock implementation detail - indicates whether the VMA has been
> > +       detached from the tree.
> > +     - Y
> > +     - mmap write, VMA write
> > +   * - vm_lock_seq
> > +     - vma lock
> > +     - VMA lock implementation detail - A sequence number used to serialise the
> > +       VMA lock, see the VMA lock section below.
> > +     - Y
> > +     - mmap write, VMA write
> > +   * - vm_lock
> > +     - vma lock
> > +     - VMA lock implementation detail - A pointer to the VMA lock read/write
> > +       semaphore.
> > +     - Y
> > +     - None required
> > +   * - shared.rb
> > +     -
> > +     - A red/black tree node used, if the mapping is file-backed, to place the
> > +       VMA in the `struct address_space->i_mmap` red/black interval tree.
> > +     -
> > +     - mmap write, VMA write, i_mmap write
> > +   * - shared.rb_subtree_last
> > +     -
> > +     - Metadata used for management of the interval tree if the VMA is
> > +       file-backed.
> > +     -
> > +     - mmap write, VMA write, i_mmap write
> > +   * - anon_vma_chain
> > +     -
> > +     - List of links to forked/CoW'd `anon_vma` objects.
> > +     -
> > +     - mmap read or above, anon_vma write lock
> > +   * - anon_vma
> > +     -
> > +     - `anon_vma` object used by anonymous folios mapped exclusively to this VMA.
> > +     -
> > +     - mmap read or above, page_table_lock
> > +   * - vm_ops
> > +     -
> > +     - If the VMA is file-backed, then either the driver or file-system provides
> > +       a `struct vm_operations_struct` object describing callbacks to be invoked
> > +       on specific VMA lifetime events.
> > +     -
> > +     - (Static)
> > +   * - vm_pgoff
> > +     -
> > +     - Describes the page offset into the file, the original page offset within
> > +       the virtual address space (prior to any `mremap()`), or PFN if a PFN map.
> > +     -
> > +     - mmap write, VMA write
> > +   * - vm_file
> > +     -
> > +     - If the VMA is file-backed, points to a `struct file` object describing
> > +       the underlying file, if anonymous then `NULL`.
> > +     -
> > +     - (Static)
> > +   * - vm_private_data
> > +     -
> > +     - A `void *` field for driver-specific metadata.
> > +     -
> > +     - Driver-mandated.
> > +   * - anon_name
> > +     - anon name
> > +     - A field for storing a `struct anon_vma_name` object providing a name for
> > +       anonymous mappings, or `NULL` if none is set or the VMA is file-backed.
> > +     -
> > +     - mmap write, VMA write
> > +   * - swap_readahead_info
> > +     - swap
> > +     - Metadata used by the swap mechanism to perform readahead.
> > +     -
> > +     - mmap read
> > +   * - vm_region
> > +     - nommu
> > +     - The containing region for the VMA for architectures which do not
> > +       possess an MMU.
> > +     - N/A
> > +     - N/A
> > +   * - vm_policy
> > +     - numa
> > +     - `mempolicy` object which describes NUMA behaviour of the VMA.
> > +     -
> > +     - mmap write, VMA write
> > +   * - numab_state
> > +     - numab
> > +     - `vma_numab_state` object which describes the current state of NUMA
> > +       balancing in relation to this VMA.
> > +     -
> > +     - mmap write, VMA write
> > +   * - vm_userfaultfd_ctx
> > +     -
> > +     - Userfaultfd context wrapper object of type `vm_userfaultfd_ctx`, either
> > +       of zero size if userfaultfd is disabled, or containing a pointer to an
> > +       underlying `userfaultfd_ctx` object which describes userfaultfd metadata.
> > +     -
> > +     - mmap write, VMA write
> > +
> > +.. note::
> > +
> > +   In the config column 'vma lock' configuration means CONFIG_PER_VMA_LOCK,
> > +   'anon name' means CONFIG_ANON_VMA_NAME, 'swap' means CONFIG_SWAP, 'nommu'
> > +   means that CONFIG_MMU is not set, 'numa' means CONFIG_NUMA and 'numab' means
> > +   CONFIG_NUMA_BALANCING'.
> > +
> > +   In the write lock column '(Static)' means that the field is set only once
> > +   upon initialisation of the VMA and not changed after this, the VMA would
> > +   either have been under an mmap write and VMA write lock at the time or not
> > +   yet inserted into any tree.
> > +
> > +Page table locks
> > +----------------
> > +
> > +When allocating a P4D, PUD or PMD and setting the relevant entry in the above
> > +PGD, P4D or PUD, the `mm->page_table_lock` is acquired to do so. This is
> > +acquired in `__p4d_alloc()`, `__pud_alloc()` and `__pmd_alloc()` respectively.
> > +
> > +.. note::
> > +   `__pmd_alloc()` actually invokes `pud_lock()` and `pud_lockptr()` in turn,
> > +   however at the time of writing it ultimately references the
> > +   `mm->page_table_lock`.
> > +
> > +Allocating a PTE will either use the `mm->page_table_lock` or, if
> > +`USE_SPLIT_PMD_PTLOCKS` is defined, used a lock embedded in the PMD physical
> > +page metadata in the form of a `struct ptdesc`, acquired by `pmd_ptdesc()`
> > +called from `pmd_lock()` and ultimately `__pte_alloc()`.
> > +
> > +Finally, modifying the contents of the PTE has special treatment, as this is a
> > +lock that we must acquire whenever we want stable and exclusive access to
> > +entries pointing to data pages within a PTE, especially when we wish to modify
> > +them.
> > +
> > +This is performed via `pte_offset_map_lock()` which carefully checks to ensure
> > +that the PTE hasn't changed from under us, ultimately invoking `pte_lockptr()`
> > +to obtain a spin lock at PTE granularity contained within the `struct ptdesc`
> > +associated with the physical PTE page. The lock must be released via
> > +`pte_unmap_unlock()`.
> > +
> > +.. note::
> > +   There are some variants on this, such as `pte_offset_map_rw_nolock()` when we
> > +   know we hold the PTE stable but for brevity we do not explore this.
> > +   See the comment for `__pte_offset_map_lock()` for more details.
> > +
> > +When modifying data in ranges we typically only wish to allocate higher page
> > +tables as necessary, using these locks to avoid races or overwriting anything,
> > +and set/clear data at the PTE level as required (for instance when page faulting
> > +or zapping).
> > +
> > +Page table teardown
> > +-------------------
> > +
> > +Tearing down page tables themselves is something that requires significant
> > +care. There must be no way that page tables designated for removal can be
> > +traversed or referenced by concurrent tasks.
> > +
> > +It is insufficient to simply hold an mmap write lock and VMA lock (which will
> > +prevent racing faults, and rmap operations), as a file-backed mapping can be
> > +truncated under the `struct address_space` i_mmap_lock alone.
> > +
> > +As a result, no VMA which can be accessed via the reverse mapping (either
> > +anon_vma or the `struct address_space->i_mmap` interval tree) can have its page
> > +tables torn down.
> > +
> > +The operation is typically performed via `free_pgtables()`, which assumes either
> > +the mmap write lock has been taken (as specified by its `mm_wr_locked`
> > +parameter), or that it the VMA is fully detached.
>
>                "or that the VMA is..." ?
>
> > +It carefully removes the VMA from all reverse mappings, however it's important
> > +that no new ones overlap these or any route remain to permit access to addresses
> > +within the range whose page tables are being torn down.
> > +
> > +As a result of these careful conditions, note that page table entries are
> > +cleared without page table locks, as it is assumed that all of these precautions
> > +have already been taken.
> > +
> > +mmap write lock downgrading
> > +---------------------------
> > +
> > +While it is possible to obtain an mmap write or read lock using the
> > +`mm->mmap_lock` read/write semaphore, it is also possible to **downgrade** from
> > +a write lock to a read lock via `mmap_write_downgrade()`.
> > +
> > +Similar to `mmap_write_unlock()`, this implicitly terminates all VMA write locks
> > +via `vma_end_write_all()` (more or this behaviour in the VMA lock internals
> > +section below), but importantly does not relinquish the mmap lock while
> > +downgrading, therefore keeping the locked virtual address space stable.
> > +
> > +A subtlety here is that callers can assume, if they invoke an
> > +mmap_write_downgrade() operation, that they still have exclusive access to the
> > +virtual address space (excluding VMA read lock holders), as for another task to
> > +have downgraded they would have had to have exclusive access to the semaphore
> > +which can't be the case until the current task completes what it is doing.
> > +
> > +Stack expansion
> > +---------------
> > +
> > +Stack expansion throws up additional complexities in that we cannot permit there
> > +to be racing page faults, as a result we invoke `vma_start_write()` to prevent
> > +this in `expand_downwards()` or `expand_upwards()`.
> > +
> > +Lock ordering
> > +-------------
> > +
> > +As we have multiple locks across the kernel which may or may not be taken at the
> > +same time as explicit mm or VMA locks, we have to be wary of lock inversion, and
> > +the **order** in which locks are acquired and released becomes very important.
> > +
> > +.. note::
> > +
> > +   Lock inversion occurs when two threads need to acquire multiple locks,
> > +   but in doing so inadvertently cause a mutual deadlock.
> > +
> > +   For example, consider thread 1 which holds lock A and tries to acquire lock B,
> > +   while thread 2 holds lock B and tries to acquire lock A.
> > +
> > +   Both threads are now deadlocked on each other. However, had they attempted to
> > +   acquire locks in the same order, one would have waited for the other to
> > +   complete its work and no deadlock would have occurred.
> > +
> > +The opening comment in `mm/rmap.c` describes in detail the required ordering of
> > +locks within memory management code:
> > +
> > +.. code-block::
> > +
> > +  inode->i_rwsem	(while writing or truncating, not reading or faulting)
> > +    mm->mmap_lock
> > +      mapping->invalidate_lock (in filemap_fault)
> > +        folio_lock
> > +          hugetlbfs_i_mmap_rwsem_key (in huge_pmd_share, see hugetlbfs below)
> > +            vma_start_write
> > +              mapping->i_mmap_rwsem
> > +                anon_vma->rwsem
> > +                  mm->page_table_lock or pte_lock
> > +                    swap_lock (in swap_duplicate, swap_info_get)
> > +                      mmlist_lock (in mmput, drain_mmlist and others)
> > +                      mapping->private_lock (in block_dirty_folio)
> > +                          i_pages lock (widely used)
> > +                            lruvec->lru_lock (in folio_lruvec_lock_irq)
> > +                      inode->i_lock (in set_page_dirty's __mark_inode_dirty)
> > +                      bdi.wb->list_lock (in set_page_dirty's __mark_inode_dirty)
> > +                        sb_lock (within inode_lock in fs/fs-writeback.c)
> > +                        i_pages lock (widely used, in set_page_dirty,
> > +                                  in arch-dependent flush_dcache_mmap_lock,
> > +                                  within bdi.wb->list_lock in __sync_single_inode)
> > +
> > +Please check the current state of this comment which may have changed since the
> > +time of writing of this document.
> > +
> > +VMA lock internals
> > +------------------
> > +
> > +The VMA lock mechanism is designed to be a lightweight means of avoiding the use
> > +of the heavily contended mmap lock. It is implemented using a combination of a
> > +read/write semaphore and sequence numbers belonging to the containing `struct
> > +mm_struct` and the VMA.
> > +
> > +Read locks are acquired via `vma_start_read()`, which is an optimistic
> > +operation, i.e. it tries to acquire a read lock but returns false if it is
> > +unable to do so. At the end of the read operation, `vma_end_read()` is called to
> > +release the VMA read lock. This can be done under RCU alone.
> > +
> > +Writing requires the mmap to be write-locked and the VMA lock to be acquired via
> > +`vma_start_write()`, however the write lock is released by the termination or
> > +downgrade of the mmap write lock so no `vma_end_write()` is required.
> > +
> > +All this is achieved by the use of per-mm and per-VMA sequence counts. This is
> > +used to reduce complexity, and potential especially around operations which
> > +write-lock multiple VMAs at once.
> > +
> > +If the mm sequence count, `mm->mm_lock_seq` is equal to the VMA sequence count
> > +`vma->vm_lock_seq` then the VMA is write-locked. If they differ, then they are
> > +not.
> > +
> > +Each time an mmap write lock is acquired in `mmap_write_lock()`,
> > +`mmap_write_lock_nested()`, `mmap_write_lock_killable()`, the `mm->mm_lock_seq`
> > +sequence number is incremented via `mm_lock_seqcount_begin()`.
> > +
> > +Each time the mmap write lock is released in `mmap_write_unlock()` or
> > +`mmap_write_downgrade()`, `vma_end_write_all()` is invoked which also increments
> > +`mm->mm_lock_seq` via `mm_lock_seqcount_end()`.
> > +
> > +This way, we ensure regardless of the VMA's sequence number count, that a write
> > +lock is not incorrectly indicated (since we increment the sequence counter on
> > +acquiring the mmap write lock, which is required in order to obtain a VMA write
> > +lock), and that when we release an mmap write lock, we efficiently release
> > +**all** VMA write locks contained within the mmap at the same time.
> > +
> > +The exclusivity of the mmap write lock ensures this is what we want, as there
> > +would never be a reason to persist per-VMA write locks across multiple mmap
> > +write lock acquisitions.
> > +
> > +Each time a VMA read lock is acquired, we acquire a read lock on the
> > +`vma->vm_lock` read/write semaphore and hold it, while checking that the
> > +sequence count of the VMA does not match that of the mm.
> > +
> > +If it does, the read lock fails. If it does not, we hold the lock, excluding
> > +writers, but permitting other readers, who will also obtain this lock under RCU.
> > +
> > +Importantly, maple tree operations performed in `lock_vma_under_rcu()` are also
> > +RCU safe, so the whole read lock operation is guaranteed to function correctly.
> > +
> > +On the write side, we acquire a write lock on the `vma->vm_lock` read/write
> > +semaphore, before setting the VMA's sequence number under this lock, also
> > +simultaneously holding the mmap write lock.
> > +
> > +This way, if any read locks are in effect, `vma_start_write()` will sleep until
> > +these are finished and mutual exclusion is achieved.
> > +
> > +After setting the VMA's sequence number, the lock is released, avoiding
> > +complexity with a long-term held write lock.
> > +
> > +This clever combination of a read/write semaphore and sequence count allows for
> > +fast RCU-based per-VMA lock acquisition (especially on x86-64 page fault, though
> > +utilised elsewhere) with minimal complexity around lock ordering.
> > --
> > 2.47.0
>
> --
> Sincerely yours,
> Mike.
Alice Ryhl Nov. 4, 2024, 2:47 p.m. UTC | #9
On Fri, Nov 1, 2024 at 7:50 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> Locking around VMAs is complicated and confusing. While we have a number of
> disparate comments scattered around the place, we seem to be reaching a
> level of complexity that justifies a serious effort at clearly documenting
> how locks are expected to be interacted with when it comes to interacting
> with mm_struct and vm_area_struct objects.
>
> This is especially pertinent as regards efforts to find sensible
> abstractions for these fundamental objects within the kernel rust
> abstraction whose compiler strictly requires some means of expressing these
> rules (and through this expression can help self-document these
> requirements as well as enforce them which is an exciting concept).
>
> The document limits scope to mmap and VMA locks and those that are
> immediately adjacent and relevant to them - so additionally covers page
> table locking as this is so very closely tied to VMA operations (and relies
> upon us handling these correctly).
>
> The document tries to cover some of the nastier and more confusing edge
> cases and concerns especially around lock ordering and page table teardown.
>
> The document also provides some VMA lock internals, which are up to date
> and inclusive of recent changes to recent sequence number changes.
>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

[...]

> +Page table locks
> +----------------
> +
> +When allocating a P4D, PUD or PMD and setting the relevant entry in the above
> +PGD, P4D or PUD, the `mm->page_table_lock` is acquired to do so. This is
> +acquired in `__p4d_alloc()`, `__pud_alloc()` and `__pmd_alloc()` respectively.
> +
> +.. note::
> +   `__pmd_alloc()` actually invokes `pud_lock()` and `pud_lockptr()` in turn,
> +   however at the time of writing it ultimately references the
> +   `mm->page_table_lock`.
> +
> +Allocating a PTE will either use the `mm->page_table_lock` or, if
> +`USE_SPLIT_PMD_PTLOCKS` is defined, used a lock embedded in the PMD physical
> +page metadata in the form of a `struct ptdesc`, acquired by `pmd_ptdesc()`
> +called from `pmd_lock()` and ultimately `__pte_alloc()`.
> +
> +Finally, modifying the contents of the PTE has special treatment, as this is a
> +lock that we must acquire whenever we want stable and exclusive access to
> +entries pointing to data pages within a PTE, especially when we wish to modify
> +them.
> +
> +This is performed via `pte_offset_map_lock()` which carefully checks to ensure
> +that the PTE hasn't changed from under us, ultimately invoking `pte_lockptr()`
> +to obtain a spin lock at PTE granularity contained within the `struct ptdesc`
> +associated with the physical PTE page. The lock must be released via
> +`pte_unmap_unlock()`.
> +
> +.. note::
> +   There are some variants on this, such as `pte_offset_map_rw_nolock()` when we
> +   know we hold the PTE stable but for brevity we do not explore this.
> +   See the comment for `__pte_offset_map_lock()` for more details.
> +
> +When modifying data in ranges we typically only wish to allocate higher page
> +tables as necessary, using these locks to avoid races or overwriting anything,
> +and set/clear data at the PTE level as required (for instance when page faulting
> +or zapping).

Speaking as someone who doesn't know the internals at all ... this
section doesn't really answer any questions I have about the page
table. It looks like this could use an initial section about basic
usage, and the detailed information could come after? Concretely, if I
wish to call vm_insert_page or zap some pages, what are the locking
requirements? What if I'm writing a page fault handler?

Alice
Mike Rapoport Nov. 4, 2024, 3:19 p.m. UTC | #10
On Mon, Nov 04, 2024 at 02:17:36PM +0000, Lorenzo Stoakes wrote:
> On Sat, Nov 02, 2024 at 11:00:20AM +0200, Mike Rapoport wrote:
> > > +
> > > +The table specifies which write locks must be held to write to the field.
> > > +
> > > +.. list-table::
> > > +   :widths: 20 10 22 5 20
> > > +   :header-rows: 1
> >
> > And use .. table here as well, e.g
> 
> Hm this one is a little less clearly worth it because not only will that take me
> ages but it'll be quite difficult to read in a sensible editor. I can if you
> insist though?

With spaces it will look just fine in a text editor and IMHO better than
list-table, but I don't insist.
 
> > .. table::
> >
> >     ======== ======== ========================== ================== ==========
> >     Field    Config   Description                Unstable RCU read? Write lock
> >     ======== ======== ========================== ================== ==========
> >     vm_start          Inclusive start virtual                       mmap write,
> >                       address of range VMA                          VMA write
> >                       describes
> >
> >     vm_end            Exclusive end virtual                         mmap write,
> >                       address of range VMA                          VMA write
> >                       describes
> >
> >     vm_rcu   vma_lock RCU list head, in union    N/A                N/A
> >                       with vma_start, vma_end.
> >                       RCU implementation detail
> >     ======== ======== ========================== ================== ==========
Lorenzo Stoakes Nov. 4, 2024, 4:42 p.m. UTC | #11
On Sat, Nov 02, 2024 at 02:45:35AM +0100, Jann Horn wrote:
> On Fri, Nov 1, 2024 at 7:50 PM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> > Locking around VMAs is complicated and confusing. While we have a number of
> > disparate comments scattered around the place, we seem to be reaching a
> > level of complexity that justifies a serious effort at clearly documenting
> > how locks are expected to be interacted with when it comes to interacting
> > with mm_struct and vm_area_struct objects.
> >
> > This is especially pertinent as regards efforts to find sensible
> > abstractions for these fundamental objects within the kernel rust
> > abstraction whose compiler strictly requires some means of expressing these
> > rules (and through this expression can help self-document these
> > requirements as well as enforce them which is an exciting concept).
> >
> > The document limits scope to mmap and VMA locks and those that are
> > immediately adjacent and relevant to them - so additionally covers page
> > table locking as this is so very closely tied to VMA operations (and relies
> > upon us handling these correctly).
> >
> > The document tries to cover some of the nastier and more confusing edge
> > cases and concerns especially around lock ordering and page table teardown.
> >
> > The document also provides some VMA lock internals, which are up to date
> > and inclusive of recent changes to recent sequence number changes.
>
> Thanks for doing this!

No problem, at this point I think it's actually critical we have this.

>
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > ---
> >
> > REVIEWERS NOTES:
> >    You can speed up doc builds by running `make SPHINXDIRS=mm htmldocs`. I
> >    also uploaded a copy of this to my website at
> >    https://ljs.io/output/mm/vma_locks to make it easier to have a quick
> >    read through. Thanks!
> >
> >
> >  Documentation/mm/index.rst     |   1 +
> >  Documentation/mm/vma_locks.rst | 527 +++++++++++++++++++++++++++++++++
> >  2 files changed, 528 insertions(+)
> >  create mode 100644 Documentation/mm/vma_locks.rst
> >
> > diff --git a/Documentation/mm/index.rst b/Documentation/mm/index.rst
> > index 0be1c7503a01..da5f30acaca5 100644
> > --- a/Documentation/mm/index.rst
> > +++ b/Documentation/mm/index.rst
> > @@ -64,3 +64,4 @@ documentation, or deleted if it has served its purpose.
> >     vmemmap_dedup
> >     z3fold
> >     zsmalloc
> > +   vma_locks
> > diff --git a/Documentation/mm/vma_locks.rst b/Documentation/mm/vma_locks.rst
> > new file mode 100644
> > index 000000000000..52b9d484376a
> > --- /dev/null
> > +++ b/Documentation/mm/vma_locks.rst
> > @@ -0,0 +1,527 @@
> > +VMA Locking
> > +===========
> > +
> > +Overview
> > +--------
> > +
> > +Userland memory ranges are tracked by the kernel via Virtual Memory Areas or
> > +'VMA's of type `struct vm_area_struct`.
> > +
> > +Each VMA describes a virtually contiguous memory range with identical
> > +attributes, each of which described by a `struct vm_area_struct`
> > +object. Userland access outside of VMAs is invalid except in the case where an
> > +adjacent stack VMA could be extended to contain the accessed address.
> > +
> > +All VMAs are contained within one and only one virtual address space, described
> > +by a `struct mm_struct` object which is referenced by all tasks (that is,
> > +threads) which share the virtual address space. We refer to this as the `mm`.
> > +
> > +Each mm object contains a maple tree data structure which describes all VMAs
> > +within the virtual address space.
>
> The gate VMA is special, on architectures that have it: Userland
> access to its area is allowed, but the area is outside the VA range
> managed by the normal MM code, and the gate VMA is a global object
> (not per-MM), and only a few places in MM code can interact with it
> (for example, page fault handling can't, but GUP can through
> get_gate_page()).
>
> (I think this also has the fun consequence that vm_normal_page() can
> get called on a VMA whose ->vm_mm is NULL, when called from
> get_gate_page().)

Yeah the gate page is weird, I'm not sure it's worth going into too much detail
here, but perhaps a note explaining in effect 'except for the gate page..'
unless you think it'd be valuable to go into that in more detail than a passing
'hey of course there's an exception to this!' comment? :)

>
> > +The kernel is designed to be highly scalable against concurrent access to
> > +userland memory, so a complicated set of locks are required to ensure no data
> > +races or memory corruption occurs.
> > +
> > +This document explores this locking in detail.
> > +
> > +.. note::
> > +
> > +   There are three different things that a user might want to achieve via
> > +   locks - the first of which is **stability**. That is - ensuring that the VMA
> > +   won't be freed or modified in any way from underneath us.
> > +
> > +   All MM and VMA locks ensure stability.
> > +
> > +   Secondly we have locks which allow **reads** but not writes (and which might
> > +   be held concurrent with other CPUs who also hold the read lock).
> > +
> > +   Finally, we have locks which permit exclusive access to the VMA to allow for
> > +   **writes** to the VMA.
>
> Maybe also mention that there are three major paths you can follow to
> reach a VMA? You can come through the mm's VMA tree, you can do an
> anon page rmap walk that goes page -> anon_vma -> vma, or you can do a
> file rmap walk from the address_space. Which is why just holding the
> mmap lock and vma lock in write mode is not enough to permit arbitrary
> changes to a VMA struct.

I totally agree that adding something about _where_ you can come from is a good
idea, will do.

However, in terms of the VMA itself, mmap lock and vma lock _are_ sufficient to
prevent arbitrary _changes_ to the VMA struct right?

It isn't sufficient to prevent _reading_ of vma metadata fields, nor walking of
underlying page tables, so if you're going to do something that changes
fundamentals you need to hide it from rmap.

Maybe worth going over relevant fields? Or rather adding an additional 'read
lock' column?

vma->vm_mm ('static' anyway after VMA created)
vma->vm_start (change on merge/split)
vma->vm_end (change on merge/split)
vma->vm_flags (can change)
vma->vm_ops ('static' anyway after call_mmap())

In any case this is absolutely _crucial_ I agree, will add.

>
> > +MM and VMA locks
> > +----------------
> > +
> > +There are two key classes of lock utilised when reading and manipulating VMAs -
> > +the `mmap_lock` which is a read/write semaphore maintained at the `mm_struct`
> > +level of granularity and, if CONFIG_PER_VMA_LOCK is set, a per-VMA lock at the
> > +VMA level of granularity.
> > +
> > +.. note::
> > +
> > +   Generally speaking, a read/write semaphore is a class of lock which permits
> > +   concurrent readers. However a write lock can only be obtained once all
> > +   readers have left the critical region (and pending readers made to wait).
> > +
> > +   This renders read locks on a read/write semaphore concurrent with other
> > +   readers and write locks exclusive against all others holding the semaphore.
> > +
> > +If CONFIG_PER_VMA_LOCK is not set, then things are relatively simple - a write
> > +mmap lock gives you exclusive write access to a VMA, and a read lock gives you
> > +concurrent read-only access.
> > +
> > +In the presence of CONFIG_PER_VMA_LOCK, i.e. VMA locks, things are more
> > +complicated. In this instance, a write semaphore is no longer enough to gain
> > +exclusive access to a VMA, a VMA write lock is also required.
> > +
> > +The VMA lock is implemented via the use of both a read/write semaphore and
> > +per-VMA and per-mm sequence numbers. We go into detail on this in the VMA lock
> > +internals section below, so for the time being it is important only to note that
> > +we can obtain either a VMA read or write lock.
> > +
> > +.. note::
> > +
> > +   VMAs under VMA **read** lock are obtained by the `lock_vma_under_rcu()`
> > +   function, and **no** existing mmap or VMA lock must be held, This function
>
> uffd_move_lock() calls lock_vma_under_rcu() after having already
> VMA-locked another VMA with uffd_lock_vma().

Oh uffd, how we love you...

I think it might be worth adding a note for this exception. Obviously they do
some pretty careful manipulation to avoid issues here so probably worth saying
'hey except uffd'

>
> > +   either returns a read-locked VMA, or NULL if the lock could not be
> > +   acquired. As the name suggests, the VMA will be acquired under RCU, though
> > +   once obtained, remains stable.
> > +   This kind of locking is entirely optimistic - if the lock is contended or a
> > +   competing write has started, then we do not obtain a read lock.
> > +
> > +   The `lock_vma_under_rcu()` function first calls `rcu_read_lock()` to ensure
> > +   that the VMA is acquired in an RCU critical section, then attempts to VMA
> > +   lock it via `vma_start_read()`, before releasing the RCU lock via
> > +   `rcu_read_unlock()`.
> > +
> > +   VMA read locks hold the a read lock on the `vma->vm_lock` semaphore for their
>
> nit: s/ the a / a /

Yeah Mike found the same thing, will fix.

>
> > +   duration and the caller of `lock_vma_under_rcu()` must release it via
> > +   `vma_end_read()`.
> > +
> > +   VMA **write** locks are acquired via `vma_start_write()` in instances where a
> > +   VMA is about to be modified, unlike `vma_start_read()` the lock is always
> > +   acquired. An mmap write lock **must** be held for the duration of the VMA
> > +   write lock, releasing or downgrading the mmap write lock also releases the
> > +   VMA write lock so there is no `vma_end_write()` function.
> > +
> > +   Note that a semaphore write lock is not held across a VMA lock. Rather, a
> > +   sequence number is used for serialisation, and the write semaphore is only
> > +   acquired at the point of write lock to update this (we explore this in detail
> > +   in the VMA lock internals section below).
> > +
> > +   This ensures the semantics we require - VMA write locks provide exclusive
> > +   write access to the VMA.
> > +
> > +Examining all valid lock state and what each implies:
> > +
> > +.. list-table::
> > +   :header-rows: 1
> > +
> > +   * - mmap lock
> > +     - VMA lock
> > +     - Stable?
> > +     - Can read safely?
> > +     - Can write safely?
> > +   * - \-
> > +     - \-
> > +     - N
> > +     - N
> > +     - N
> > +   * - R
> > +     - \-
> > +     - Y
> > +     - Y
> > +     - N
> > +   * - \-
> > +     - R
> > +     - Y
> > +     - Y
> > +     - N
> > +   * - W
> > +     - \-
> > +     - Y
> > +     - Y
> > +     - N
> > +   * - W
> > +     - W
> > +     - Y
> > +     - Y
> > +     - Y
> > +
> > +Note that there are some exceptions to this - the `anon_vma` field is permitted
> > +to be written to under mmap read lock and is instead serialised by the `struct
> > +mm_struct` field `page_table_lock`. In addition the `vm_mm` and all
>
> Hm, we really ought to add some smp_store_release() and READ_ONCE(),
> or something along those lines, around our ->anon_vma accesses...
> especially the "vma->anon_vma = anon_vma" assignment in
> __anon_vma_prepare() looks to me like, on architectures like arm64
> with write-write reordering, we could theoretically end up making a
> new anon_vma pointer visible to a concurrent page fault before the
> anon_vma has been initialized? Though I have no idea if that is
> practically possible, stuff would have to be reordered quite a bit for
> that to happen...

They make me nervous too, yes.

>
> > +lock-specific fields are permitted to be read under RCU alone  (though stability cannot
> > +be expected in this instance).
> > +
> > +.. note::
> > +   The most notable place to use the VMA read lock is on page table faults on
>
> s/page table faults/page faults/?
>

Ack will fix.

> > +   the x86-64 architecture, which importantly means that without a VMA write
>
> it's wired up to a bunch of architectures at this point - arm, arm64,
> powerpc, riscv, s390, x86 all use lock_vma_under_rcu().

Ah is it? Hadn't double checked that and clearly out of date, will update.

>
> > +   lock, page faults can race against you even if you hold an mmap write lock.
> > +
> > +VMA Fields
> > +----------
> > +
> > +We examine each field of the `struct vm_area_struct` type in detail in the table
> > +below.
> > +
> > +Reading of each field requires either an mmap read lock or a VMA read lock to be
> > +held, except where 'unstable RCU read' is specified, in which case unstable
> > +access to the field is permitted under RCU alone.
> > +
> > +The table specifies which write locks must be held to write to the field.
>
> vm_start, vm_end and vm_pgoff also require that the associated
> address_space and anon_vma (if applicable) are write-locked, and that
> their rbtrees are updated as needed.

Surely vm_flags too...

>
> > +.. list-table::
> > +   :widths: 20 10 22 5 20
> > +   :header-rows: 1
> > +
> > +   * - Field
> > +     - Config
> > +     - Description
> > +     - Unstable RCU read?
> > +     - Write Lock
> > +   * - vm_start
> > +     -
> > +     - Inclusive start virtual address of range VMA describes.
> > +     -
> > +     - mmap write, VMA write
> > +   * - vm_end
> > +     -
> > +     - Exclusive end virtual address of range VMA describes.
> > +     -
> > +     - mmap write, VMA write
> > +   * - vm_rcu
> > +     - vma lock
> > +     - RCU list head, in union with vma_start, vma_end. RCU implementation detail.
> > +     - N/A
> > +     - N/A
> > +   * - vm_mm
> > +     -
> > +     - Containing mm_struct.
> > +     - Y
> > +     - (Static)
> > +   * - vm_page_prot
> > +     -
> > +     - Architecture-specific page table protection bits determined from VMA
> > +       flags
> > +     -
> > +     - mmap write, VMA write
> > +   * - vm_flags
> > +     -
> > +     - Read-only access to VMA flags describing attributes of VMA, in union with
> > +       private writable `__vm_flags`.
> > +     -
> > +     - N/A
> > +   * - __vm_flags
> > +     -
> > +     - Private, writable access to VMA flags field, updated by vm_flags_*()
> > +       functions.
> > +     -
> > +     - mmap write, VMA write
> > +   * - detached
> > +     - vma lock
> > +     - VMA lock implementation detail - indicates whether the VMA has been
> > +       detached from the tree.
> > +     - Y
> > +     - mmap write, VMA write
> > +   * - vm_lock_seq
> > +     - vma lock
> > +     - VMA lock implementation detail - A sequence number used to serialise the
> > +       VMA lock, see the VMA lock section below.
> > +     - Y
> > +     - mmap write, VMA write
>
> I think "mmap write" is accurate, but "VMA write" is inaccurate -
> you'd need to have already written to the vm_lock_seq in order to have
> a VMA write lock.

Yes my mistake, will correct!

>
> > +   * - vm_lock
> > +     - vma lock
> > +     - VMA lock implementation detail - A pointer to the VMA lock read/write
> > +       semaphore.
> > +     - Y
> > +     - None required
> > +   * - shared.rb
> > +     -
> > +     - A red/black tree node used, if the mapping is file-backed, to place the
> > +       VMA in the `struct address_space->i_mmap` red/black interval tree.
> > +     -
> > +     - mmap write, VMA write, i_mmap write
> > +   * - shared.rb_subtree_last
> > +     -
> > +     - Metadata used for management of the interval tree if the VMA is
> > +       file-backed.
> > +     -
> > +     - mmap write, VMA write, i_mmap write
> > +   * - anon_vma_chain
> > +     -
> > +     - List of links to forked/CoW'd `anon_vma` objects.
> > +     -
> > +     - mmap read or above, anon_vma write lock
> > +   * - anon_vma
> > +     -
> > +     - `anon_vma` object used by anonymous folios mapped exclusively to this VMA.
> > +     -
> > +     - mmap read or above, page_table_lock
> > +   * - vm_ops
> > +     -
> > +     - If the VMA is file-backed, then either the driver or file-system provides
> > +       a `struct vm_operations_struct` object describing callbacks to be invoked
> > +       on specific VMA lifetime events.
> > +     -
> > +     - (Static)
> > +   * - vm_pgoff
> > +     -
> > +     - Describes the page offset into the file, the original page offset within
> > +       the virtual address space (prior to any `mremap()`), or PFN if a PFN map.
>
> Ooh, right, I had forgotten about this quirk, and I think I never
> fully understood these rules... it's a PFN if the VMA is
> private+maywrite+pfnmap. And the vma->vm_pgoff is set in
> remap_pfn_range_internal() under those conditions.

Yeah it's horrid. The whole mremap() hack that makes it the 'original' virtual
page offset on mmap() but not update afterwards is equally quite horrid.

>
> Huh, so for example, if you are in an environment where usbdev_mmap()
> uses remap_pfn_range() (which depends on hardware - it seems to work
> inside QEMU but not on real machine), and you have at least read
> access to a device at /dev/bus/usb/*/* (which are normally
> world-readable), you can actually do this:
>
> user@vm:/tmp$ cat usb-get-physaddr.c
> #include <err.h>
> #include <stdlib.h>
> #include <fcntl.h>
> #include <sys/mman.h>
> #define SYSCHK(x) ({          \
>   typeof(x) __res = (x);      \
>   if (__res == (typeof(x))-1) \
>     err(1, "SYSCHK(" #x ")"); \
>   __res;                      \
> })
> int main(int argc, char **argv) {
>   if (argc != 2)
>     errx(1, "expect one argument (usbdev path)");
>   int fd = SYSCHK(open(argv[1], O_RDONLY));
>   SYSCHK(mmap((void*)0x10000, 0x1000, PROT_READ|PROT_WRITE,
> MAP_PRIVATE|MAP_FIXED_NOREPLACE, fd, 0));
>   system("head -n1 /proc/$PPID/maps");
> }
> user@vm:/tmp$ gcc -o usb-get-physaddr usb-get-physaddr.c
> user@vm:/tmp$ ./usb-get-physaddr /dev/bus/usb/001/001
> 00010000-00011000 rw-p 0103f000 00:06 135
>   /dev/bus/usb/001/001
> user@vm:/tmp$ ./usb-get-physaddr /dev/bus/usb/001/001
> 00010000-00011000 rw-p 0103f000 00:06 135
>   /dev/bus/usb/001/001
> user@vm:/tmp$ ./usb-get-physaddr /dev/bus/usb/001/001
> 00010000-00011000 rw-p 0107e000 00:06 135
>   /dev/bus/usb/001/001
> user@vm:/tmp$ ./usb-get-physaddr /dev/bus/usb/001/001
> 00010000-00011000 rw-p 010bd000 00:06 135
>   /dev/bus/usb/001/001
> user@vm:/tmp$
>
> and see physical addresses in the offset field in /proc/*/maps...
> that's not great. And pointless on architectures with
> CONFIG_ARCH_HAS_PTE_SPECIAL, from what I can tell.

Yeah, vm_normal_page() has a nice comment on this insanity.

Actually I'll update this to specify that just to be clear.

>
>
> > +     -
> > +     - mmap write, VMA write
> > +   * - vm_file
> > +     -
> > +     - If the VMA is file-backed, points to a `struct file` object describing
> > +       the underlying file, if anonymous then `NULL`.
> > +     -
> > +     - (Static)
> > +   * - vm_private_data
> > +     -
> > +     - A `void *` field for driver-specific metadata.
> > +     -
> > +     - Driver-mandated.
> > +   * - anon_name
> > +     - anon name
> > +     - A field for storing a `struct anon_vma_name` object providing a name for
> > +       anonymous mappings, or `NULL` if none is set or the VMA is file-backed.
> > +     -
> > +     - mmap write, VMA write
> > +   * - swap_readahead_info
> > +     - swap
> > +     - Metadata used by the swap mechanism to perform readahead.
> > +     -
> > +     - mmap read
> > +   * - vm_region
> > +     - nommu
> > +     - The containing region for the VMA for architectures which do not
> > +       possess an MMU.
> > +     - N/A
> > +     - N/A
> > +   * - vm_policy
> > +     - numa
> > +     - `mempolicy` object which describes NUMA behaviour of the VMA.
> > +     -
> > +     - mmap write, VMA write
> > +   * - numab_state
> > +     - numab
> > +     - `vma_numab_state` object which describes the current state of NUMA
> > +       balancing in relation to this VMA.
> > +     -
> > +     - mmap write, VMA write
>
> I think task_numa_work() is only holding the mmap lock in read mode
> when it sets this pointer to a non-NULL value.

ugh lord... knew I'd get at least one of these wrong :P

Yeah you're right, will fix!

>
> > +   * - vm_userfaultfd_ctx
> > +     -
> > +     - Userfaultfd context wrapper object of type `vm_userfaultfd_ctx`, either
> > +       of zero size if userfaultfd is disabled, or containing a pointer to an
> > +       underlying `userfaultfd_ctx` object which describes userfaultfd metadata.
> > +     -
> > +     - mmap write, VMA write
> > +
> > +.. note::
> > +
> > +   In the config column 'vma lock' configuration means CONFIG_PER_VMA_LOCK,
> > +   'anon name' means CONFIG_ANON_VMA_NAME, 'swap' means CONFIG_SWAP, 'nommu'
> > +   means that CONFIG_MMU is not set, 'numa' means CONFIG_NUMA and 'numab' means
> > +   CONFIG_NUMA_BALANCING'.
> > +
> > +   In the write lock column '(Static)' means that the field is set only once
> > +   upon initialisation of the VMA and not changed after this, the VMA would
> > +   either have been under an mmap write and VMA write lock at the time or not
> > +   yet inserted into any tree.
> > +
> > +Page table locks
> > +----------------
> > +
> > +When allocating a P4D, PUD or PMD and setting the relevant entry in the above
> > +PGD, P4D or PUD, the `mm->page_table_lock` is acquired to do so. This is
> > +acquired in `__p4d_alloc()`, `__pud_alloc()` and `__pmd_alloc()` respectively.
> > +
> > +.. note::
> > +   `__pmd_alloc()` actually invokes `pud_lock()` and `pud_lockptr()` in turn,
> > +   however at the time of writing it ultimately references the
> > +   `mm->page_table_lock`.
> > +
> > +Allocating a PTE will either use the `mm->page_table_lock` or, if
> > +`USE_SPLIT_PMD_PTLOCKS` is defined, used a lock embedded in the PMD physical
> > +page metadata in the form of a `struct ptdesc`, acquired by `pmd_ptdesc()`
> > +called from `pmd_lock()` and ultimately `__pte_alloc()`.
> >+
> > +Finally, modifying the contents of the PTE has special treatment, as this is a
> > +lock that we must acquire whenever we want stable and exclusive access to
> > +entries pointing to data pages within a PTE, especially when we wish to modify
> > +them.
>
> I guess one other perspective on this would be to focus on the
> circumstances under which you're allowed to write entries:
>
> 0. page tables can be concurrently read by hardware and GUP-fast, so
> writes must always be appropriately atomic

Yeah I definitely need to mention GUP-fast considerations (and consequently
the pXX_lockless..() functions). Thanks for raising that,  important one.

> 1. changing a page table entry always requires locking the containing
> page table (except when the write is an A/D update by hardware)

I think we can ignore the hardware writes altogether, though I think worth
adding a 'note' to explain this can happen outside of this framework
altogether.

> 2. in page tables higher than PMD level, page table entries that point
> to page tables can only be changed to point to something else when
> holding all the relevant high-level locks leading to the VMA in
> exclusive mode: mmap lock (unless the VMA is detached), VMA lock,
> anon_vma, address_space

Right this seems mremap()-specific when you say 'change' here :) and of
course, we have code that explicitly does this (take_rmap_locks() +
drop_rmap_locks()).

> 3. PMD entries that point to page tables can be changed while holding
> the page table spinlocks for the entry and the table it points to

Hm wut? When you say 'entry' what do you mean? Obviously a page table in
theory could be changed at any point you don't have it locked and to be
sure it hasn't you have to lock + check again.

> 4. lowest-level page tables can be in high memory, so they must be
> kmapped for access, and pte_offset_map_lock() does that for you

I kind of don't really like to bother talking about 32-bit kernels (or at
least 32-bit kernels that have to use high memory) as I consider them
completely deprecated ;)

Might be worth the briefest of brief mentions...

> 5. entries in "none" state can only be populated with leaf entries
> while holding the mmap or vma lock (doing it through the rmap would be
> bad because that could race with munmap() zapping data pages in the
> region)
> 6. leaf entries can be zapped (changed to "none") while holding any
> one of mmap lock, vma lock, address_space lock, or anon_vma lock

For both 5 and 6 - I'm not sure if we ever zap without holding the mmap
lock do we?

Unless you're including folio_mkclean() and pfn_mkclean_range()? I guess
this is 'strike of the linux kernel terminology' once again :P

Yeah in that case sure.

OK so interestingly this really aligns with what Alice said as to this not
giving a clear indicator from a user's perspective as to 'what lock do I
need to hold'.

So I will absolutely address all this and try to get the fundamentals
boiled down.

Also obviously the exception to your rules are - _freeing_ of higher level
page tables because we assume we are in a state where nothing can access
them so no such locks are required. But I cover that below.

>
> And then the rules for readers mostly follow from that:
> 1 => holding the appropriate page table lock makes the contents of a
> page table stable, except for A/D updates
> 2 => page table entries higher than PMD level that point to lower page
> tables can be followed without taking page table locks

Yeah this is true actually, might be worth mentioning page table walkers
here and how they operate as they're instructive on page table locking
requirements.

> 3+4 => following PMD entries pointing to page tables requires careful
> locking, and pte_offset_map_lock() does that for you

Well, pte_offset_map_lock() is obtained at the PTE level right?

pmd_lock() at the PMD level (pud_lock() ostensibly at PUD level but this
amounts to an mm->page_table_lock anyway there)

>
> Ah, though now I see the page table teardown section below already has
> some of this information.
>
> > +This is performed via `pte_offset_map_lock()` which carefully checks to ensure
> > +that the PTE hasn't changed from under us, ultimately invoking `pte_lockptr()`
> > +to obtain a spin lock at PTE granularity contained within the `struct ptdesc`
> > +associated with the physical PTE page. The lock must be released via
> > +`pte_unmap_unlock()`.
>
> Sidenote: Not your fault that the Linux terminology for this sucks,
> but the way this section uses "PTE" to describe a page table rather
> than a Page Table Entry is pretty confusing to me... in my head, a
> pte_t is a Page Table Entry (PTE), a pte_t* is a Page Table or a Page
> Table Entry Pointer (depending on context), a pmd_t is a Page Middle
> Directory Entry, and a pmd_t* is a Page Middle Directory or a Page
> Middle Directory Entry Pointer. (Though to make things easier I
> normally think of them as L1 entry, L1 table, L2 entry, L2 table.)

I actually wanted at some point to change the naming to be consistent in
the kernel (though it'd be huge churn... but hey that's my speciality
right? ;) kernel.

The PTE = the page table directory as well as the Page Table Entry
directory is extremely shit yes.

>
> > +.. note::
> > +   There are some variants on this, such as `pte_offset_map_rw_nolock()` when we
> > +   know we hold the PTE stable but for brevity we do not explore this.
> > +   See the comment for `__pte_offset_map_lock()` for more details.
> > +
> > +When modifying data in ranges we typically only wish to allocate higher page
> > +tables as necessary, using these locks to avoid races or overwriting anything,
> > +and set/clear data at the PTE level as required (for instance when page faulting
> > +or zapping).
> > +
> > +Page table teardown
> > +-------------------
> > +
> > +Tearing down page tables themselves is something that requires significant
> > +care. There must be no way that page tables designated for removal can be
> > +traversed or referenced by concurrent tasks.
>
> (except by hardware or with gup_fast() which behaves roughly like a
> hardware page walker and completely ignores what is happening at the
> VMA layer)

Yeah I definitely need to address the gup_fast() stuff. Will do so in
teardown section here too.

>
> > +It is insufficient to simply hold an mmap write lock and VMA lock (which will
> > +prevent racing faults, and rmap operations), as a file-backed mapping can be
> > +truncated under the `struct address_space` i_mmap_lock alone.
> > +
> > +As a result, no VMA which can be accessed via the reverse mapping (either
> > +anon_vma or the `struct address_space->i_mmap` interval tree) can have its page
> > +tables torn down.
>
> (except last-level page tables: khugepaged already deletes those for
> file mappings without using the mmap lock at all in
> retract_page_tables(), and there is a pending series that will do the
> same with page tables in other VMAs too, see
> <https://lore.kernel.org/all/cover.1729157502.git.zhengqi.arch@bytedance.com/>)

Ugh wut OK haha. Will look into this.

>
> > +The operation is typically performed via `free_pgtables()`, which assumes either
> > +the mmap write lock has been taken (as specified by its `mm_wr_locked`
> > +parameter), or that it the VMA is fully detached.
>
> nit: s/that it the/that the/

Ack will fix.

>
> > +It carefully removes the VMA from all reverse mappings, however it's important
> > +that no new ones overlap these or any route remain to permit access to addresses
> > +within the range whose page tables are being torn down.
> > +
> > +As a result of these careful conditions, note that page table entries are
> > +cleared without page table locks, as it is assumed that all of these precautions
> > +have already been taken.
>
> Oh, I didn't realize this... interesting.
>
> > +mmap write lock downgrading
> > +---------------------------
> > +
> > +While it is possible to obtain an mmap write or read lock using the
> > +`mm->mmap_lock` read/write semaphore, it is also possible to **downgrade** from
> > +a write lock to a read lock via `mmap_write_downgrade()`.
> > +
> > +Similar to `mmap_write_unlock()`, this implicitly terminates all VMA write locks
> > +via `vma_end_write_all()` (more or this behaviour in the VMA lock internals
>
> typo: s/or/on/

Ack will fix.

>
> > +section below), but importantly does not relinquish the mmap lock while
> > +downgrading, therefore keeping the locked virtual address space stable.
> > +
> > +A subtlety here is that callers can assume, if they invoke an
> > +mmap_write_downgrade() operation, that they still have exclusive access to the
> > +virtual address space (excluding VMA read lock holders), as for another task to
> > +have downgraded they would have had to have exclusive access to the semaphore
> > +which can't be the case until the current task completes what it is doing.
> > +
> > +Stack expansion
> > +---------------
> > +
> > +Stack expansion throws up additional complexities in that we cannot permit there
> > +to be racing page faults, as a result we invoke `vma_start_write()` to prevent
> > +this in `expand_downwards()` or `expand_upwards()`.
>
> And this needs the mmap lock in write mode, so stack expansion is only
> done in codepaths where we can reliably get that - so it happens on
> fault handling, but not on GUP. This probably creates the fun quirk
> that, in theory, the following scenario could happen:
>
> 1. a userspace program creates a large on-stack buffer (which exceeds
> the bounds of the current stack VMA but is within the stack size
> limit)
> 2. userspace calls something like the read() syscall on this buffer
> (without writing to any deeper part of the stack - so this can't
> happen when you call into a non-inlined library function for read() on
> x86, but it might happen on arm64, where a function call does not
> require writing to the stack)
> 3. the kernel read() handler is trying to do something like direct I/O
> and uses GUP to pin the user-supplied pages (and does not use
> copy_to_user(), which would be more common)
> 4. GUP fails, the read() fails
>
> But this was probably the least bad option to deal with existing stack
> expansion issues.

Hm that seems like a just so set of circumstances though :P

I was around for that whole 'mmap read lock only needed for stack
expansion' series and discussion and it was all very very horrible.

This is definitely the least bad option at least for now.

Funny thing is, examining that code led me to the patch I sent for
eliminating the additional locking... as I found with the book, staring at
code for the purposes of explaining it naturally leads to patches :)

>
> > +Lock ordering
> > +-------------
> > +
> > +As we have multiple locks across the kernel which may or may not be taken at the
> > +same time as explicit mm or VMA locks, we have to be wary of lock inversion, and
> > +the **order** in which locks are acquired and released becomes very important.
> > +
> > +.. note::
> > +
> > +   Lock inversion occurs when two threads need to acquire multiple locks,
> > +   but in doing so inadvertently cause a mutual deadlock.
> > +
> > +   For example, consider thread 1 which holds lock A and tries to acquire lock B,
> > +   while thread 2 holds lock B and tries to acquire lock A.
> > +
> > +   Both threads are now deadlocked on each other. However, had they attempted to
> > +   acquire locks in the same order, one would have waited for the other to
> > +   complete its work and no deadlock would have occurred.
> > +
> > +The opening comment in `mm/rmap.c` describes in detail the required ordering of
> > +locks within memory management code:
> > +
> > +.. code-block::
> > +
> > +  inode->i_rwsem       (while writing or truncating, not reading or faulting)
> > +    mm->mmap_lock
> > +      mapping->invalidate_lock (in filemap_fault)
> > +        folio_lock
> > +          hugetlbfs_i_mmap_rwsem_key (in huge_pmd_share, see hugetlbfs below)
> > +            vma_start_write
> > +              mapping->i_mmap_rwsem
> > +                anon_vma->rwsem
> > +                  mm->page_table_lock or pte_lock
> > +                    swap_lock (in swap_duplicate, swap_info_get)
> > +                      mmlist_lock (in mmput, drain_mmlist and others)
> > +                      mapping->private_lock (in block_dirty_folio)
> > +                          i_pages lock (widely used)
> > +                            lruvec->lru_lock (in folio_lruvec_lock_irq)
> > +                      inode->i_lock (in set_page_dirty's __mark_inode_dirty)
> > +                      bdi.wb->list_lock (in set_page_dirty's __mark_inode_dirty)
> > +                        sb_lock (within inode_lock in fs/fs-writeback.c)
> > +                        i_pages lock (widely used, in set_page_dirty,
> > +                                  in arch-dependent flush_dcache_mmap_lock,
> > +                                  within bdi.wb->list_lock in __sync_single_inode)
> > +
> > +Please check the current state of this comment which may have changed since the
> > +time of writing of this document.
>
> I think something like
> https://www.kernel.org/doc/html/latest/doc-guide/kernel-doc.html#overview-documentation-comments
> is supposed to let you include the current version of the comment into
> the rendered documentation HTML without having to manually keep things
> in sync. I've never used that myself, but there are a bunch of
> examples in the tree; for example, grep for "DMA fences overview".

Ah, but this isn't a kernel doc is just a raw comment :) so I'm not sure there
is a great way of grabbing just that, reliably. Maybe can turn that into a
kernel doc comment in a follow up patch or something?


Thanks for review, very much appreciate you taking the time to do this, I
will update and send out a v2 with your + other's suggested changes soon.

I think I'll keep it RFC for now until it settles a bit and we agree the
details are right as I strongly feel this document is of critical
importance at this stage, especially in light of rust people needing this
kind of detail.
Lorenzo Stoakes Nov. 4, 2024, 4:52 p.m. UTC | #12
+cc Suren, linux-doc who I mistakenly didn't cc in first email!

On Mon, Nov 04, 2024 at 03:47:56PM +0100, Alice Ryhl wrote:
> On Fri, Nov 1, 2024 at 7:50 PM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> >
> > Locking around VMAs is complicated and confusing. While we have a number of
> > disparate comments scattered around the place, we seem to be reaching a
> > level of complexity that justifies a serious effort at clearly documenting
> > how locks are expected to be interacted with when it comes to interacting
> > with mm_struct and vm_area_struct objects.
> >
> > This is especially pertinent as regards efforts to find sensible
> > abstractions for these fundamental objects within the kernel rust
> > abstraction whose compiler strictly requires some means of expressing these
> > rules (and through this expression can help self-document these
> > requirements as well as enforce them which is an exciting concept).
> >
> > The document limits scope to mmap and VMA locks and those that are
> > immediately adjacent and relevant to them - so additionally covers page
> > table locking as this is so very closely tied to VMA operations (and relies
> > upon us handling these correctly).
> >
> > The document tries to cover some of the nastier and more confusing edge
> > cases and concerns especially around lock ordering and page table teardown.
> >
> > The document also provides some VMA lock internals, which are up to date
> > and inclusive of recent changes to recent sequence number changes.
> >
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>
> [...]
>
> > +Page table locks
> > +----------------
> > +
> > +When allocating a P4D, PUD or PMD and setting the relevant entry in the above
> > +PGD, P4D or PUD, the `mm->page_table_lock` is acquired to do so. This is
> > +acquired in `__p4d_alloc()`, `__pud_alloc()` and `__pmd_alloc()` respectively.
> > +
> > +.. note::
> > +   `__pmd_alloc()` actually invokes `pud_lock()` and `pud_lockptr()` in turn,
> > +   however at the time of writing it ultimately references the
> > +   `mm->page_table_lock`.
> > +
> > +Allocating a PTE will either use the `mm->page_table_lock` or, if
> > +`USE_SPLIT_PMD_PTLOCKS` is defined, used a lock embedded in the PMD physical
> > +page metadata in the form of a `struct ptdesc`, acquired by `pmd_ptdesc()`
> > +called from `pmd_lock()` and ultimately `__pte_alloc()`.
> > +
> > +Finally, modifying the contents of the PTE has special treatment, as this is a
> > +lock that we must acquire whenever we want stable and exclusive access to
> > +entries pointing to data pages within a PTE, especially when we wish to modify
> > +them.
> > +
> > +This is performed via `pte_offset_map_lock()` which carefully checks to ensure
> > +that the PTE hasn't changed from under us, ultimately invoking `pte_lockptr()`
> > +to obtain a spin lock at PTE granularity contained within the `struct ptdesc`
> > +associated with the physical PTE page. The lock must be released via
> > +`pte_unmap_unlock()`.
> > +
> > +.. note::
> > +   There are some variants on this, such as `pte_offset_map_rw_nolock()` when we
> > +   know we hold the PTE stable but for brevity we do not explore this.
> > +   See the comment for `__pte_offset_map_lock()` for more details.
> > +
> > +When modifying data in ranges we typically only wish to allocate higher page
> > +tables as necessary, using these locks to avoid races or overwriting anything,
> > +and set/clear data at the PTE level as required (for instance when page faulting
> > +or zapping).
>
> Speaking as someone who doesn't know the internals at all ... this
> section doesn't really answer any questions I have about the page
> table. It looks like this could use an initial section about basic
> usage, and the detailed information could come after? Concretely, if I
> wish to call vm_insert_page or zap some pages, what are the locking
> requirements? What if I'm writing a page fault handler?

Ack totally agree, I think we need this document to serve two purposes -
one is to go over, in detail, the locking requirements from an mm dev's
point of view with internals focus, and secondly to give those outside mm
this kind of information.

It's good to get insight from an outside perspective as inevitably we mm
devs lose sight of the wood for the trees when it comes to internals
vs. practical needs of those who make use of mm in one respect or another.

So this kind of feedback is very helpful and welcome :) TL;DR - yes I will
explicitly state what is required for various operations on the respin.

>
> Alice

As a wordy aside, a large part of the motivation of this document, or
certainly my prioritisation of it, is explicitly to help the rust team
correctly abstract this aspect of mm.

The other part is to help the mm team, that is especailly myself, correctly
understand and _remember_ the numerous painful ins and outs of this stuff,
much of which has been pertinent of late for not wonderfully positive
reasons.

Hopefully we accomplish both! :>)
Suren Baghdasaryan Nov. 4, 2024, 5:01 p.m. UTC | #13
On Fri, Nov 1, 2024 at 11:51 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> Locking around VMAs is complicated and confusing. While we have a number of
> disparate comments scattered around the place, we seem to be reaching a
> level of complexity that justifies a serious effort at clearly documenting
> how locks are expected to be interacted with when it comes to interacting
> with mm_struct and vm_area_struct objects.
>
> This is especially pertinent as regards efforts to find sensible
> abstractions for these fundamental objects within the kernel rust
> abstraction whose compiler strictly requires some means of expressing these
> rules (and through this expression can help self-document these
> requirements as well as enforce them which is an exciting concept).
>
> The document limits scope to mmap and VMA locks and those that are
> immediately adjacent and relevant to them - so additionally covers page
> table locking as this is so very closely tied to VMA operations (and relies
> upon us handling these correctly).
>
> The document tries to cover some of the nastier and more confusing edge
> cases and concerns especially around lock ordering and page table teardown.
>
> The document also provides some VMA lock internals, which are up to date
> and inclusive of recent changes to recent sequence number changes.
>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

Thanks for documenting this, Lorenzo!
Just heads-up, I'm working on changing some of the implementation
details (removing vma->detached, moving vm_lock into vm_area_struct,
etc.). I should be able to post those changes sometime later this week
if testing does not reveal any issues.

> ---
>
> REVIEWERS NOTES:
>    You can speed up doc builds by running `make SPHINXDIRS=mm htmldocs`. I
>    also uploaded a copy of this to my website at
>    https://ljs.io/output/mm/vma_locks to make it easier to have a quick
>    read through. Thanks!
>
>
>  Documentation/mm/index.rst     |   1 +
>  Documentation/mm/vma_locks.rst | 527 +++++++++++++++++++++++++++++++++
>  2 files changed, 528 insertions(+)
>  create mode 100644 Documentation/mm/vma_locks.rst
>
> diff --git a/Documentation/mm/index.rst b/Documentation/mm/index.rst
> index 0be1c7503a01..da5f30acaca5 100644
> --- a/Documentation/mm/index.rst
> +++ b/Documentation/mm/index.rst
> @@ -64,3 +64,4 @@ documentation, or deleted if it has served its purpose.
>     vmemmap_dedup
>     z3fold
>     zsmalloc
> +   vma_locks
> diff --git a/Documentation/mm/vma_locks.rst b/Documentation/mm/vma_locks.rst
> new file mode 100644
> index 000000000000..52b9d484376a
> --- /dev/null
> +++ b/Documentation/mm/vma_locks.rst
> @@ -0,0 +1,527 @@
> +VMA Locking
> +===========
> +
> +Overview
> +--------
> +
> +Userland memory ranges are tracked by the kernel via Virtual Memory Areas or
> +'VMA's of type `struct vm_area_struct`.
> +
> +Each VMA describes a virtually contiguous memory range with identical
> +attributes, each of which described by a `struct vm_area_struct`
> +object. Userland access outside of VMAs is invalid except in the case where an
> +adjacent stack VMA could be extended to contain the accessed address.
> +
> +All VMAs are contained within one and only one virtual address space, described
> +by a `struct mm_struct` object which is referenced by all tasks (that is,
> +threads) which share the virtual address space. We refer to this as the `mm`.
> +
> +Each mm object contains a maple tree data structure which describes all VMAs
> +within the virtual address space.
> +
> +The kernel is designed to be highly scalable against concurrent access to
> +userland memory, so a complicated set of locks are required to ensure no data
> +races or memory corruption occurs.
> +
> +This document explores this locking in detail.
> +
> +.. note::
> +
> +   There are three different things that a user might want to achieve via
> +   locks - the first of which is **stability**. That is - ensuring that the VMA
> +   won't be freed or modified in any way from underneath us.
> +
> +   All MM and VMA locks ensure stability.
> +
> +   Secondly we have locks which allow **reads** but not writes (and which might
> +   be held concurrent with other CPUs who also hold the read lock).
> +
> +   Finally, we have locks which permit exclusive access to the VMA to allow for
> +   **writes** to the VMA.
> +
> +MM and VMA locks
> +----------------
> +
> +There are two key classes of lock utilised when reading and manipulating VMAs -
> +the `mmap_lock` which is a read/write semaphore maintained at the `mm_struct`
> +level of granularity and, if CONFIG_PER_VMA_LOCK is set, a per-VMA lock at the
> +VMA level of granularity.
> +
> +.. note::
> +
> +   Generally speaking, a read/write semaphore is a class of lock which permits
> +   concurrent readers. However a write lock can only be obtained once all
> +   readers have left the critical region (and pending readers made to wait).
> +
> +   This renders read locks on a read/write semaphore concurrent with other
> +   readers and write locks exclusive against all others holding the semaphore.
> +
> +If CONFIG_PER_VMA_LOCK is not set, then things are relatively simple - a write
> +mmap lock gives you exclusive write access to a VMA, and a read lock gives you
> +concurrent read-only access.
> +
> +In the presence of CONFIG_PER_VMA_LOCK, i.e. VMA locks, things are more
> +complicated. In this instance, a write semaphore is no longer enough to gain
> +exclusive access to a VMA, a VMA write lock is also required.

I think "exclusive access to a VMA" should be "exclusive access to mm"
if you are talking about mmap_lock.

I think it's worth adding here:
1. to take a VMA write-lock you need to be holding an mmap_lock;
2. write-unlocking mmap_lock drops all VMA write locks in that mm.

I see that you touch on this in the below "Note" section but that's
not an implementation detail but the designed behavior, so I think
these should not be mere side-notes.

> +
> +The VMA lock is implemented via the use of both a read/write semaphore and
> +per-VMA and per-mm sequence numbers. We go into detail on this in the VMA lock
> +internals section below, so for the time being it is important only to note that
> +we can obtain either a VMA read or write lock.
> +
> +.. note::
> +
> +   VMAs under VMA **read** lock are obtained by the `lock_vma_under_rcu()`
> +   function, and **no** existing mmap or VMA lock must be held, This function

"no existing mmap or VMA lock must be held" did you mean to say "no
exclusive mmap or VMA locks must be held"? Because one can certainly
hold a read-lock on them.

> +   either returns a read-locked VMA, or NULL if the lock could not be
> +   acquired. As the name suggests, the VMA will be acquired under RCU, though
> +   once obtained, remains stable.
> +
> +   This kind of locking is entirely optimistic - if the lock is contended or a
> +   competing write has started, then we do not obtain a read lock.
> +
> +   The `lock_vma_under_rcu()` function first calls `rcu_read_lock()` to ensure
> +   that the VMA is acquired in an RCU critical section, then attempts to VMA
> +   lock it via `vma_start_read()`, before releasing the RCU lock via
> +   `rcu_read_unlock()`.
> +
> +   VMA read locks hold the a read lock on the `vma->vm_lock` semaphore for their
> +   duration and the caller of `lock_vma_under_rcu()` must release it via
> +   `vma_end_read()`.
> +
> +   VMA **write** locks are acquired via `vma_start_write()` in instances where a
> +   VMA is about to be modified, unlike `vma_start_read()` the lock is always
> +   acquired. An mmap write lock **must** be held for the duration of the VMA
> +   write lock, releasing or downgrading the mmap write lock also releases the
> +   VMA write lock so there is no `vma_end_write()` function.
> +
> +   Note that a semaphore write lock is not held across a VMA lock. Rather, a
> +   sequence number is used for serialisation, and the write semaphore is only
> +   acquired at the point of write lock to update this (we explore this in detail
> +   in the VMA lock internals section below).
> +
> +   This ensures the semantics we require - VMA write locks provide exclusive
> +   write access to the VMA.
> +
> +Examining all valid lock state and what each implies:
> +
> +.. list-table::
> +   :header-rows: 1
> +
> +   * - mmap lock
> +     - VMA lock
> +     - Stable?
> +     - Can read safely?
> +     - Can write safely?
> +   * - \-
> +     - \-
> +     - N
> +     - N
> +     - N
> +   * - R
> +     - \-
> +     - Y
> +     - Y
> +     - N
> +   * - \-
> +     - R
> +     - Y
> +     - Y
> +     - N
> +   * - W
> +     - \-
> +     - Y
> +     - Y
> +     - N
> +   * - W
> +     - W
> +     - Y
> +     - Y
> +     - Y
> +
> +Note that there are some exceptions to this - the `anon_vma` field is permitted
> +to be written to under mmap read lock and is instead serialised by the `struct
> +mm_struct` field `page_table_lock`. In addition the `vm_mm` and all
> +lock-specific fields are permitted to be read under RCU alone  (though stability cannot
> +be expected in this instance).
> +
> +.. note::
> +   The most notable place to use the VMA read lock is on page table faults on
> +   the x86-64 architecture, which importantly means that without a VMA write

As Jann mentioned, CONFIG_PER_VMA_LOCK is supported on many more architectures.

> +   lock, page faults can race against you even if you hold an mmap write lock.
> +
> +VMA Fields
> +----------
> +
> +We examine each field of the `struct vm_area_struct` type in detail in the table
> +below.
> +
> +Reading of each field requires either an mmap read lock or a VMA read lock to be
> +held, except where 'unstable RCU read' is specified, in which case unstable
> +access to the field is permitted under RCU alone.
> +
> +The table specifies which write locks must be held to write to the field.
> +
> +.. list-table::
> +   :widths: 20 10 22 5 20
> +   :header-rows: 1
> +
> +   * - Field
> +     - Config
> +     - Description
> +     - Unstable RCU read?
> +     - Write Lock
> +   * - vm_start
> +     -
> +     - Inclusive start virtual address of range VMA describes.
> +     -
> +     - mmap write, VMA write
> +   * - vm_end
> +     -
> +     - Exclusive end virtual address of range VMA describes.
> +     -
> +     - mmap write, VMA write
> +   * - vm_rcu
> +     - vma lock
> +     - RCU list head, in union with vma_start, vma_end. RCU implementation detail.
> +     - N/A
> +     - N/A
> +   * - vm_mm
> +     -
> +     - Containing mm_struct.
> +     - Y
> +     - (Static)
> +   * - vm_page_prot
> +     -
> +     - Architecture-specific page table protection bits determined from VMA
> +       flags
> +     -
> +     - mmap write, VMA write
> +   * - vm_flags
> +     -
> +     - Read-only access to VMA flags describing attributes of VMA, in union with
> +       private writable `__vm_flags`.
> +     -
> +     - N/A
> +   * - __vm_flags
> +     -
> +     - Private, writable access to VMA flags field, updated by vm_flags_*()
> +       functions.
> +     -
> +     - mmap write, VMA write
> +   * - detached
> +     - vma lock
> +     - VMA lock implementation detail - indicates whether the VMA has been
> +       detached from the tree.
> +     - Y
> +     - mmap write, VMA write
> +   * - vm_lock_seq
> +     - vma lock
> +     - VMA lock implementation detail - A sequence number used to serialise the
> +       VMA lock, see the VMA lock section below.
> +     - Y
> +     - mmap write, VMA write

It's a bit weird to state that VMA write-lock is required when talking
about vm_lock_seq/vm_lock themselves being parts of that lock. I would
simply say N/A for both of them since they should not be modified
directly.

> +   * - vm_lock
> +     - vma lock
> +     - VMA lock implementation detail - A pointer to the VMA lock read/write
> +       semaphore.
> +     - Y
> +     - None required
> +   * - shared.rb
> +     -
> +     - A red/black tree node used, if the mapping is file-backed, to place the
> +       VMA in the `struct address_space->i_mmap` red/black interval tree.
> +     -
> +     - mmap write, VMA write, i_mmap write
> +   * - shared.rb_subtree_last
> +     -
> +     - Metadata used for management of the interval tree if the VMA is
> +       file-backed.
> +     -
> +     - mmap write, VMA write, i_mmap write
> +   * - anon_vma_chain
> +     -
> +     - List of links to forked/CoW'd `anon_vma` objects.
> +     -
> +     - mmap read or above, anon_vma write lock

nit: I would spell it out for clarity: mmap read or write

> +   * - anon_vma
> +     -
> +     - `anon_vma` object used by anonymous folios mapped exclusively to this VMA.
> +     -
> +     - mmap read or above, page_table_lock
> +   * - vm_ops
> +     -
> +     - If the VMA is file-backed, then either the driver or file-system provides
> +       a `struct vm_operations_struct` object describing callbacks to be invoked
> +       on specific VMA lifetime events.
> +     -
> +     - (Static)
> +   * - vm_pgoff
> +     -
> +     - Describes the page offset into the file, the original page offset within
> +       the virtual address space (prior to any `mremap()`), or PFN if a PFN map.
> +     -
> +     - mmap write, VMA write
> +   * - vm_file
> +     -
> +     - If the VMA is file-backed, points to a `struct file` object describing
> +       the underlying file, if anonymous then `NULL`.
> +     -
> +     - (Static)
> +   * - vm_private_data
> +     -
> +     - A `void *` field for driver-specific metadata.
> +     -
> +     - Driver-mandated.
> +   * - anon_name
> +     - anon name
> +     - A field for storing a `struct anon_vma_name` object providing a name for
> +       anonymous mappings, or `NULL` if none is set or the VMA is file-backed.
> +     -
> +     - mmap write, VMA write
> +   * - swap_readahead_info
> +     - swap
> +     - Metadata used by the swap mechanism to perform readahead.
> +     -
> +     - mmap read
> +   * - vm_region
> +     - nommu
> +     - The containing region for the VMA for architectures which do not
> +       possess an MMU.
> +     - N/A
> +     - N/A
> +   * - vm_policy
> +     - numa
> +     - `mempolicy` object which describes NUMA behaviour of the VMA.
> +     -
> +     - mmap write, VMA write
> +   * - numab_state
> +     - numab
> +     - `vma_numab_state` object which describes the current state of NUMA
> +       balancing in relation to this VMA.
> +     -
> +     - mmap write, VMA write
> +   * - vm_userfaultfd_ctx
> +     -
> +     - Userfaultfd context wrapper object of type `vm_userfaultfd_ctx`, either
> +       of zero size if userfaultfd is disabled, or containing a pointer to an
> +       underlying `userfaultfd_ctx` object which describes userfaultfd metadata.
> +     -
> +     - mmap write, VMA write
> +
> +.. note::
> +
> +   In the config column 'vma lock' configuration means CONFIG_PER_VMA_LOCK,
> +   'anon name' means CONFIG_ANON_VMA_NAME, 'swap' means CONFIG_SWAP, 'nommu'
> +   means that CONFIG_MMU is not set, 'numa' means CONFIG_NUMA and 'numab' means
> +   CONFIG_NUMA_BALANCING'.
> +
> +   In the write lock column '(Static)' means that the field is set only once
> +   upon initialisation of the VMA and not changed after this, the VMA would
> +   either have been under an mmap write and VMA write lock at the time or not
> +   yet inserted into any tree.
> +
> +Page table locks
> +----------------
> +
> +When allocating a P4D, PUD or PMD and setting the relevant entry in the above
> +PGD, P4D or PUD, the `mm->page_table_lock` is acquired to do so. This is
> +acquired in `__p4d_alloc()`, `__pud_alloc()` and `__pmd_alloc()` respectively.
> +
> +.. note::
> +   `__pmd_alloc()` actually invokes `pud_lock()` and `pud_lockptr()` in turn,
> +   however at the time of writing it ultimately references the
> +   `mm->page_table_lock`.
> +
> +Allocating a PTE will either use the `mm->page_table_lock` or, if
> +`USE_SPLIT_PMD_PTLOCKS` is defined, used a lock embedded in the PMD physical
> +page metadata in the form of a `struct ptdesc`, acquired by `pmd_ptdesc()`
> +called from `pmd_lock()` and ultimately `__pte_alloc()`.
> +
> +Finally, modifying the contents of the PTE has special treatment, as this is a
> +lock that we must acquire whenever we want stable and exclusive access to
> +entries pointing to data pages within a PTE, especially when we wish to modify
> +them.
> +
> +This is performed via `pte_offset_map_lock()` which carefully checks to ensure
> +that the PTE hasn't changed from under us, ultimately invoking `pte_lockptr()`
> +to obtain a spin lock at PTE granularity contained within the `struct ptdesc`
> +associated with the physical PTE page. The lock must be released via
> +`pte_unmap_unlock()`.
> +
> +.. note::
> +   There are some variants on this, such as `pte_offset_map_rw_nolock()` when we
> +   know we hold the PTE stable but for brevity we do not explore this.
> +   See the comment for `__pte_offset_map_lock()` for more details.
> +
> +When modifying data in ranges we typically only wish to allocate higher page
> +tables as necessary, using these locks to avoid races or overwriting anything,
> +and set/clear data at the PTE level as required (for instance when page faulting
> +or zapping).
> +
> +Page table teardown
> +-------------------
> +
> +Tearing down page tables themselves is something that requires significant
> +care. There must be no way that page tables designated for removal can be
> +traversed or referenced by concurrent tasks.
> +
> +It is insufficient to simply hold an mmap write lock and VMA lock (which will
> +prevent racing faults, and rmap operations), as a file-backed mapping can be
> +truncated under the `struct address_space` i_mmap_lock alone.
> +
> +As a result, no VMA which can be accessed via the reverse mapping (either
> +anon_vma or the `struct address_space->i_mmap` interval tree) can have its page
> +tables torn down.
> +
> +The operation is typically performed via `free_pgtables()`, which assumes either
> +the mmap write lock has been taken (as specified by its `mm_wr_locked`
> +parameter), or that it the VMA is fully detached.
> +
> +It carefully removes the VMA from all reverse mappings, however it's important
> +that no new ones overlap these or any route remain to permit access to addresses
> +within the range whose page tables are being torn down.
> +
> +As a result of these careful conditions, note that page table entries are
> +cleared without page table locks, as it is assumed that all of these precautions
> +have already been taken.
> +
> +mmap write lock downgrading
> +---------------------------
> +
> +While it is possible to obtain an mmap write or read lock using the
> +`mm->mmap_lock` read/write semaphore, it is also possible to **downgrade** from
> +a write lock to a read lock via `mmap_write_downgrade()`.
> +
> +Similar to `mmap_write_unlock()`, this implicitly terminates all VMA write locks
> +via `vma_end_write_all()` (more or this behaviour in the VMA lock internals
> +section below), but importantly does not relinquish the mmap lock while
> +downgrading, therefore keeping the locked virtual address space stable.
> +
> +A subtlety here is that callers can assume, if they invoke an
> +mmap_write_downgrade() operation, that they still have exclusive access to the
> +virtual address space (excluding VMA read lock holders), as for another task to
> +have downgraded they would have had to have exclusive access to the semaphore
> +which can't be the case until the current task completes what it is doing.

I can't decipher the above paragraph. Could you please dumb it down
for the likes of me?

> +
> +Stack expansion
> +---------------
> +
> +Stack expansion throws up additional complexities in that we cannot permit there
> +to be racing page faults, as a result we invoke `vma_start_write()` to prevent
> +this in `expand_downwards()` or `expand_upwards()`.
> +
> +Lock ordering
> +-------------
> +
> +As we have multiple locks across the kernel which may or may not be taken at the
> +same time as explicit mm or VMA locks, we have to be wary of lock inversion, and
> +the **order** in which locks are acquired and released becomes very important.
> +
> +.. note::
> +
> +   Lock inversion occurs when two threads need to acquire multiple locks,
> +   but in doing so inadvertently cause a mutual deadlock.
> +
> +   For example, consider thread 1 which holds lock A and tries to acquire lock B,
> +   while thread 2 holds lock B and tries to acquire lock A.
> +
> +   Both threads are now deadlocked on each other. However, had they attempted to
> +   acquire locks in the same order, one would have waited for the other to
> +   complete its work and no deadlock would have occurred.
> +
> +The opening comment in `mm/rmap.c` describes in detail the required ordering of
> +locks within memory management code:
> +
> +.. code-block::
> +
> +  inode->i_rwsem       (while writing or truncating, not reading or faulting)
> +    mm->mmap_lock
> +      mapping->invalidate_lock (in filemap_fault)
> +        folio_lock
> +          hugetlbfs_i_mmap_rwsem_key (in huge_pmd_share, see hugetlbfs below)
> +            vma_start_write
> +              mapping->i_mmap_rwsem
> +                anon_vma->rwsem
> +                  mm->page_table_lock or pte_lock
> +                    swap_lock (in swap_duplicate, swap_info_get)
> +                      mmlist_lock (in mmput, drain_mmlist and others)
> +                      mapping->private_lock (in block_dirty_folio)
> +                          i_pages lock (widely used)
> +                            lruvec->lru_lock (in folio_lruvec_lock_irq)
> +                      inode->i_lock (in set_page_dirty's __mark_inode_dirty)
> +                      bdi.wb->list_lock (in set_page_dirty's __mark_inode_dirty)
> +                        sb_lock (within inode_lock in fs/fs-writeback.c)
> +                        i_pages lock (widely used, in set_page_dirty,
> +                                  in arch-dependent flush_dcache_mmap_lock,
> +                                  within bdi.wb->list_lock in __sync_single_inode)
> +
> +Please check the current state of this comment which may have changed since the
> +time of writing of this document.
> +
> +VMA lock internals
> +------------------
> +
> +The VMA lock mechanism is designed to be a lightweight means of avoiding the use
> +of the heavily contended mmap lock. It is implemented using a combination of a
> +read/write semaphore and sequence numbers belonging to the containing `struct
> +mm_struct` and the VMA.
> +
> +Read locks are acquired via `vma_start_read()`, which is an optimistic
> +operation, i.e. it tries to acquire a read lock but returns false if it is
> +unable to do so. At the end of the read operation, `vma_end_read()` is called to
> +release the VMA read lock. This can be done under RCU alone.
> +
> +Writing requires the mmap to be write-locked and the VMA lock to be acquired via
> +`vma_start_write()`, however the write lock is released by the termination or
> +downgrade of the mmap write lock so no `vma_end_write()` is required.
> +
> +All this is achieved by the use of per-mm and per-VMA sequence counts. This is
> +used to reduce complexity, and potential especially around operations which

potential?

> +write-lock multiple VMAs at once.
> +
> +If the mm sequence count, `mm->mm_lock_seq` is equal to the VMA sequence count
> +`vma->vm_lock_seq` then the VMA is write-locked. If they differ, then they are
> +not.
> +
> +Each time an mmap write lock is acquired in `mmap_write_lock()`,
> +`mmap_write_lock_nested()`, `mmap_write_lock_killable()`, the `mm->mm_lock_seq`
> +sequence number is incremented via `mm_lock_seqcount_begin()`.
> +
> +Each time the mmap write lock is released in `mmap_write_unlock()` or
> +`mmap_write_downgrade()`, `vma_end_write_all()` is invoked which also increments
> +`mm->mm_lock_seq` via `mm_lock_seqcount_end()`.
> +
> +This way, we ensure regardless of the VMA's sequence number count, that a write
> +lock is not incorrectly indicated (since we increment the sequence counter on
> +acquiring the mmap write lock, which is required in order to obtain a VMA write
> +lock), and that when we release an mmap write lock, we efficiently release
> +**all** VMA write locks contained within the mmap at the same time.

Ok, I see that you describe some of the rules I mentioned before here.
Up to you where to place them.

> +
> +The exclusivity of the mmap write lock ensures this is what we want, as there
> +would never be a reason to persist per-VMA write locks across multiple mmap
> +write lock acquisitions.
> +
> +Each time a VMA read lock is acquired, we acquire a read lock on the
> +`vma->vm_lock` read/write semaphore and hold it, while checking that the
> +sequence count of the VMA does not match that of the mm.
> +
> +If it does, the read lock fails. If it does not, we hold the lock, excluding
> +writers, but permitting other readers, who will also obtain this lock under RCU.
> +
> +Importantly, maple tree operations performed in `lock_vma_under_rcu()` are also
> +RCU safe, so the whole read lock operation is guaranteed to function correctly.
> +
> +On the write side, we acquire a write lock on the `vma->vm_lock` read/write
> +semaphore, before setting the VMA's sequence number under this lock, also
> +simultaneously holding the mmap write lock.
> +
> +This way, if any read locks are in effect, `vma_start_write()` will sleep until
> +these are finished and mutual exclusion is achieved.
> +
> +After setting the VMA's sequence number, the lock is released, avoiding
> +complexity with a long-term held write lock.
> +
> +This clever combination of a read/write semaphore and sequence count allows for
> +fast RCU-based per-VMA lock acquisition (especially on x86-64 page fault, though
> +utilised elsewhere) with minimal complexity around lock ordering.
> --
> 2.47.0
>
Lorenzo Stoakes Nov. 4, 2024, 7:55 p.m. UTC | #14
On Mon, Nov 04, 2024 at 03:47:40PM +0200, Mike Rapoport wrote:
> On Mon, Nov 04, 2024 at 01:02:19PM +0000, Lorenzo Stoakes wrote:
> > On Fri, Nov 01, 2024 at 04:48:32PM -0700, SeongJae Park wrote:
> > >
> > > This is the "Unsorted Documentation" section.  If the document is really for
> > > the section, I'd suggest putting it in alphabetically sorted order, for the
> > > consistency.  However, if putting the document under the section is not your
> > > real intention, I think it might be better to be put under "Process Addresses"
> > > section above.  What do you think?
> >
> > Well, at the moment it's sort of a WIP thing that we may want to put under
> > another section, was just putting there somewhat arbitrarily for now.
> >
> > I also wanted to avoid too much debate about what to put where :P
> >
> > But absolutely, ack, will either sort it there or put it somewhere more
> > sensible, thanks!
>
> Don't mean to bikeshed, but it would make sense to put it to the "Process
> Address (space)" part :)

Ack will do :)

>
> --
> Sincerely yours,
> Mike.
Lorenzo Stoakes Nov. 4, 2024, 9:04 p.m. UTC | #15
On Mon, Nov 04, 2024 at 09:01:46AM -0800, Suren Baghdasaryan wrote:
> On Fri, Nov 1, 2024 at 11:51 AM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> >
> > Locking around VMAs is complicated and confusing. While we have a number of
> > disparate comments scattered around the place, we seem to be reaching a
> > level of complexity that justifies a serious effort at clearly documenting
> > how locks are expected to be interacted with when it comes to interacting
> > with mm_struct and vm_area_struct objects.
> >
> > This is especially pertinent as regards efforts to find sensible
> > abstractions for these fundamental objects within the kernel rust
> > abstraction whose compiler strictly requires some means of expressing these
> > rules (and through this expression can help self-document these
> > requirements as well as enforce them which is an exciting concept).
> >
> > The document limits scope to mmap and VMA locks and those that are
> > immediately adjacent and relevant to them - so additionally covers page
> > table locking as this is so very closely tied to VMA operations (and relies
> > upon us handling these correctly).
> >
> > The document tries to cover some of the nastier and more confusing edge
> > cases and concerns especially around lock ordering and page table teardown.
> >
> > The document also provides some VMA lock internals, which are up to date
> > and inclusive of recent changes to recent sequence number changes.
> >
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>
> Thanks for documenting this, Lorenzo!

No worries, I feel it's very important to document this at this stage.

> Just heads-up, I'm working on changing some of the implementation
> details (removing vma->detached, moving vm_lock into vm_area_struct,
> etc.). I should be able to post those changes sometime later this week
> if testing does not reveal any issues.

Ack yeah we can update as we go, as for removing vma->detached, how are we able
to do this?

My understanding is that detached VMAs are ones that are being removed (due
to e.g.  merge/MAP_FIXED mmap()/munmap()) that are due to be RCU freed (as
vm_area_free() does this via call_rcu() so delays until grace period), but
which have been VMA unlocked prior to the grace period so
lock_vma_under_rcu() might grab but shouldn't do anything with + retry.

Will there be a new means of determining this?

Anyway... we can update as we go :)

>
> > ---
> >
> > REVIEWERS NOTES:
> >    You can speed up doc builds by running `make SPHINXDIRS=mm htmldocs`. I
> >    also uploaded a copy of this to my website at
> >    https://ljs.io/output/mm/vma_locks to make it easier to have a quick
> >    read through. Thanks!
> >
> >
> >  Documentation/mm/index.rst     |   1 +
> >  Documentation/mm/vma_locks.rst | 527 +++++++++++++++++++++++++++++++++
> >  2 files changed, 528 insertions(+)
> >  create mode 100644 Documentation/mm/vma_locks.rst
> >
> > diff --git a/Documentation/mm/index.rst b/Documentation/mm/index.rst
> > index 0be1c7503a01..da5f30acaca5 100644
> > --- a/Documentation/mm/index.rst
> > +++ b/Documentation/mm/index.rst
> > @@ -64,3 +64,4 @@ documentation, or deleted if it has served its purpose.
> >     vmemmap_dedup
> >     z3fold
> >     zsmalloc
> > +   vma_locks
> > diff --git a/Documentation/mm/vma_locks.rst b/Documentation/mm/vma_locks.rst
> > new file mode 100644
> > index 000000000000..52b9d484376a
> > --- /dev/null
> > +++ b/Documentation/mm/vma_locks.rst
> > @@ -0,0 +1,527 @@
> > +VMA Locking
> > +===========
> > +
> > +Overview
> > +--------
> > +
> > +Userland memory ranges are tracked by the kernel via Virtual Memory Areas or
> > +'VMA's of type `struct vm_area_struct`.
> > +
> > +Each VMA describes a virtually contiguous memory range with identical
> > +attributes, each of which described by a `struct vm_area_struct`
> > +object. Userland access outside of VMAs is invalid except in the case where an
> > +adjacent stack VMA could be extended to contain the accessed address.
> > +
> > +All VMAs are contained within one and only one virtual address space, described
> > +by a `struct mm_struct` object which is referenced by all tasks (that is,
> > +threads) which share the virtual address space. We refer to this as the `mm`.
> > +
> > +Each mm object contains a maple tree data structure which describes all VMAs
> > +within the virtual address space.
> > +
> > +The kernel is designed to be highly scalable against concurrent access to
> > +userland memory, so a complicated set of locks are required to ensure no data
> > +races or memory corruption occurs.
> > +
> > +This document explores this locking in detail.
> > +
> > +.. note::
> > +
> > +   There are three different things that a user might want to achieve via
> > +   locks - the first of which is **stability**. That is - ensuring that the VMA
> > +   won't be freed or modified in any way from underneath us.
> > +
> > +   All MM and VMA locks ensure stability.
> > +
> > +   Secondly we have locks which allow **reads** but not writes (and which might
> > +   be held concurrent with other CPUs who also hold the read lock).
> > +
> > +   Finally, we have locks which permit exclusive access to the VMA to allow for
> > +   **writes** to the VMA.
> > +
> > +MM and VMA locks
> > +----------------
> > +
> > +There are two key classes of lock utilised when reading and manipulating VMAs -
> > +the `mmap_lock` which is a read/write semaphore maintained at the `mm_struct`
> > +level of granularity and, if CONFIG_PER_VMA_LOCK is set, a per-VMA lock at the
> > +VMA level of granularity.
> > +
> > +.. note::
> > +
> > +   Generally speaking, a read/write semaphore is a class of lock which permits
> > +   concurrent readers. However a write lock can only be obtained once all
> > +   readers have left the critical region (and pending readers made to wait).
> > +
> > +   This renders read locks on a read/write semaphore concurrent with other
> > +   readers and write locks exclusive against all others holding the semaphore.
> > +
> > +If CONFIG_PER_VMA_LOCK is not set, then things are relatively simple - a write
> > +mmap lock gives you exclusive write access to a VMA, and a read lock gives you
> > +concurrent read-only access.
> > +
> > +In the presence of CONFIG_PER_VMA_LOCK, i.e. VMA locks, things are more
> > +complicated. In this instance, a write semaphore is no longer enough to gain
> > +exclusive access to a VMA, a VMA write lock is also required.
>
> I think "exclusive access to a VMA" should be "exclusive access to mm"
> if you are talking about mmap_lock.

Right, but in the past an mm write lock was sufficient to gain exclusive
access to a _vma_. I will adjust to say 'write semaphore on the mm'.

>
> I think it's worth adding here:
> 1. to take a VMA write-lock you need to be holding an mmap_lock;
> 2. write-unlocking mmap_lock drops all VMA write locks in that mm.
>
> I see that you touch on this in the below "Note" section but that's
> not an implementation detail but the designed behavior, so I think
> these should not be mere side-notes.

Right yeah I do mention both of these, but perhaps it's worth explicitly
saying this right at the top. Will add.

>
> > +
> > +The VMA lock is implemented via the use of both a read/write semaphore and
> > +per-VMA and per-mm sequence numbers. We go into detail on this in the VMA lock
> > +internals section below, so for the time being it is important only to note that
> > +we can obtain either a VMA read or write lock.
> > +
> > +.. note::
> > +
> > +   VMAs under VMA **read** lock are obtained by the `lock_vma_under_rcu()`
> > +   function, and **no** existing mmap or VMA lock must be held, This function
>
> "no existing mmap or VMA lock must be held" did you mean to say "no
> exclusive mmap or VMA locks must be held"? Because one can certainly
> hold a read-lock on them.

Hmm really? You can hold an mmap read lock and obtain a VMA read lock too
irrespective of that?

OK, my mistake will update this and the table below to reflect this,
thanks!

Also I see that this part that was in a 'note' section is probably a bit
wordy, which somewhat takes away from the key messages, will try to trim a
bit or separate out to make things clearer.

>
> > +   either returns a read-locked VMA, or NULL if the lock could not be
> > +   acquired. As the name suggests, the VMA will be acquired under RCU, though
> > +   once obtained, remains stable.
> > +
> > +   This kind of locking is entirely optimistic - if the lock is contended or a
> > +   competing write has started, then we do not obtain a read lock.
> > +
> > +   The `lock_vma_under_rcu()` function first calls `rcu_read_lock()` to ensure
> > +   that the VMA is acquired in an RCU critical section, then attempts to VMA
> > +   lock it via `vma_start_read()`, before releasing the RCU lock via
> > +   `rcu_read_unlock()`.
> > +
> > +   VMA read locks hold the a read lock on the `vma->vm_lock` semaphore for their
> > +   duration and the caller of `lock_vma_under_rcu()` must release it via
> > +   `vma_end_read()`.
> > +
> > +   VMA **write** locks are acquired via `vma_start_write()` in instances where a
> > +   VMA is about to be modified, unlike `vma_start_read()` the lock is always
> > +   acquired. An mmap write lock **must** be held for the duration of the VMA
> > +   write lock, releasing or downgrading the mmap write lock also releases the
> > +   VMA write lock so there is no `vma_end_write()` function.
> > +
> > +   Note that a semaphore write lock is not held across a VMA lock. Rather, a
> > +   sequence number is used for serialisation, and the write semaphore is only
> > +   acquired at the point of write lock to update this (we explore this in detail
> > +   in the VMA lock internals section below).
> > +
> > +   This ensures the semantics we require - VMA write locks provide exclusive
> > +   write access to the VMA.
> > +
> > +Examining all valid lock state and what each implies:
> > +
> > +.. list-table::
> > +   :header-rows: 1
> > +
> > +   * - mmap lock
> > +     - VMA lock
> > +     - Stable?
> > +     - Can read safely?
> > +     - Can write safely?
> > +   * - \-
> > +     - \-
> > +     - N
> > +     - N
> > +     - N
> > +   * - R
> > +     - \-
> > +     - Y
> > +     - Y
> > +     - N
> > +   * - \-
> > +     - R
> > +     - Y
> > +     - Y
> > +     - N
> > +   * - W
> > +     - \-
> > +     - Y
> > +     - Y
> > +     - N
> > +   * - W
> > +     - W
> > +     - Y
> > +     - Y
> > +     - Y
> > +
> > +Note that there are some exceptions to this - the `anon_vma` field is permitted
> > +to be written to under mmap read lock and is instead serialised by the `struct
> > +mm_struct` field `page_table_lock`. In addition the `vm_mm` and all
> > +lock-specific fields are permitted to be read under RCU alone  (though stability cannot
> > +be expected in this instance).
> > +
> > +.. note::
> > +   The most notable place to use the VMA read lock is on page table faults on
> > +   the x86-64 architecture, which importantly means that without a VMA write
>
> As Jann mentioned, CONFIG_PER_VMA_LOCK is supported on many more architectures.

Yes, have updated to say so already. Sorry I was behind on how much this
had progressed :>)

>
> > +   lock, page faults can race against you even if you hold an mmap write lock.
> > +
> > +VMA Fields
> > +----------
> > +
> > +We examine each field of the `struct vm_area_struct` type in detail in the table
> > +below.
> > +
> > +Reading of each field requires either an mmap read lock or a VMA read lock to be
> > +held, except where 'unstable RCU read' is specified, in which case unstable
> > +access to the field is permitted under RCU alone.
> > +
> > +The table specifies which write locks must be held to write to the field.
> > +
> > +.. list-table::
> > +   :widths: 20 10 22 5 20
> > +   :header-rows: 1
> > +
> > +   * - Field
> > +     - Config
> > +     - Description
> > +     - Unstable RCU read?
> > +     - Write Lock
> > +   * - vm_start
> > +     -
> > +     - Inclusive start virtual address of range VMA describes.
> > +     -
> > +     - mmap write, VMA write
> > +   * - vm_end
> > +     -
> > +     - Exclusive end virtual address of range VMA describes.
> > +     -
> > +     - mmap write, VMA write
> > +   * - vm_rcu
> > +     - vma lock
> > +     - RCU list head, in union with vma_start, vma_end. RCU implementation detail.
> > +     - N/A
> > +     - N/A
> > +   * - vm_mm
> > +     -
> > +     - Containing mm_struct.
> > +     - Y
> > +     - (Static)
> > +   * - vm_page_prot
> > +     -
> > +     - Architecture-specific page table protection bits determined from VMA
> > +       flags
> > +     -
> > +     - mmap write, VMA write
> > +   * - vm_flags
> > +     -
> > +     - Read-only access to VMA flags describing attributes of VMA, in union with
> > +       private writable `__vm_flags`.
> > +     -
> > +     - N/A
> > +   * - __vm_flags
> > +     -
> > +     - Private, writable access to VMA flags field, updated by vm_flags_*()
> > +       functions.
> > +     -
> > +     - mmap write, VMA write
> > +   * - detached
> > +     - vma lock
> > +     - VMA lock implementation detail - indicates whether the VMA has been
> > +       detached from the tree.
> > +     - Y
> > +     - mmap write, VMA write
> > +   * - vm_lock_seq
> > +     - vma lock
> > +     - VMA lock implementation detail - A sequence number used to serialise the
> > +       VMA lock, see the VMA lock section below.
> > +     - Y
> > +     - mmap write, VMA write
>
> It's a bit weird to state that VMA write-lock is required when talking
> about vm_lock_seq/vm_lock themselves being parts of that lock. I would
> simply say N/A for both of them since they should not be modified
> directly.

Ack will adjust.

>
> > +   * - vm_lock
> > +     - vma lock
> > +     - VMA lock implementation detail - A pointer to the VMA lock read/write
> > +       semaphore.
> > +     - Y
> > +     - None required
> > +   * - shared.rb
> > +     -
> > +     - A red/black tree node used, if the mapping is file-backed, to place the
> > +       VMA in the `struct address_space->i_mmap` red/black interval tree.
> > +     -
> > +     - mmap write, VMA write, i_mmap write
> > +   * - shared.rb_subtree_last
> > +     -
> > +     - Metadata used for management of the interval tree if the VMA is
> > +       file-backed.
> > +     -
> > +     - mmap write, VMA write, i_mmap write
> > +   * - anon_vma_chain
> > +     -
> > +     - List of links to forked/CoW'd `anon_vma` objects.
> > +     -
> > +     - mmap read or above, anon_vma write lock
>
> nit: I would spell it out for clarity: mmap read or write

Ack will fix

>
> > +   * - anon_vma
> > +     -
> > +     - `anon_vma` object used by anonymous folios mapped exclusively to this VMA.
> > +     -
> > +     - mmap read or above, page_table_lock
> > +   * - vm_ops
> > +     -
> > +     - If the VMA is file-backed, then either the driver or file-system provides
> > +       a `struct vm_operations_struct` object describing callbacks to be invoked
> > +       on specific VMA lifetime events.
> > +     -
> > +     - (Static)
> > +   * - vm_pgoff
> > +     -
> > +     - Describes the page offset into the file, the original page offset within
> > +       the virtual address space (prior to any `mremap()`), or PFN if a PFN map.
> > +     -
> > +     - mmap write, VMA write
> > +   * - vm_file
> > +     -
> > +     - If the VMA is file-backed, points to a `struct file` object describing
> > +       the underlying file, if anonymous then `NULL`.
> > +     -
> > +     - (Static)
> > +   * - vm_private_data
> > +     -
> > +     - A `void *` field for driver-specific metadata.
> > +     -
> > +     - Driver-mandated.
> > +   * - anon_name
> > +     - anon name
> > +     - A field for storing a `struct anon_vma_name` object providing a name for
> > +       anonymous mappings, or `NULL` if none is set or the VMA is file-backed.
> > +     -
> > +     - mmap write, VMA write
> > +   * - swap_readahead_info
> > +     - swap
> > +     - Metadata used by the swap mechanism to perform readahead.
> > +     -
> > +     - mmap read
> > +   * - vm_region
> > +     - nommu
> > +     - The containing region for the VMA for architectures which do not
> > +       possess an MMU.
> > +     - N/A
> > +     - N/A
> > +   * - vm_policy
> > +     - numa
> > +     - `mempolicy` object which describes NUMA behaviour of the VMA.
> > +     -
> > +     - mmap write, VMA write
> > +   * - numab_state
> > +     - numab
> > +     - `vma_numab_state` object which describes the current state of NUMA
> > +       balancing in relation to this VMA.
> > +     -
> > +     - mmap write, VMA write
> > +   * - vm_userfaultfd_ctx
> > +     -
> > +     - Userfaultfd context wrapper object of type `vm_userfaultfd_ctx`, either
> > +       of zero size if userfaultfd is disabled, or containing a pointer to an
> > +       underlying `userfaultfd_ctx` object which describes userfaultfd metadata.
> > +     -
> > +     - mmap write, VMA write
> > +
> > +.. note::
> > +
> > +   In the config column 'vma lock' configuration means CONFIG_PER_VMA_LOCK,
> > +   'anon name' means CONFIG_ANON_VMA_NAME, 'swap' means CONFIG_SWAP, 'nommu'
> > +   means that CONFIG_MMU is not set, 'numa' means CONFIG_NUMA and 'numab' means
> > +   CONFIG_NUMA_BALANCING'.
> > +
> > +   In the write lock column '(Static)' means that the field is set only once
> > +   upon initialisation of the VMA and not changed after this, the VMA would
> > +   either have been under an mmap write and VMA write lock at the time or not
> > +   yet inserted into any tree.
> > +
> > +Page table locks
> > +----------------
> > +
> > +When allocating a P4D, PUD or PMD and setting the relevant entry in the above
> > +PGD, P4D or PUD, the `mm->page_table_lock` is acquired to do so. This is
> > +acquired in `__p4d_alloc()`, `__pud_alloc()` and `__pmd_alloc()` respectively.
> > +
> > +.. note::
> > +   `__pmd_alloc()` actually invokes `pud_lock()` and `pud_lockptr()` in turn,
> > +   however at the time of writing it ultimately references the
> > +   `mm->page_table_lock`.
> > +
> > +Allocating a PTE will either use the `mm->page_table_lock` or, if
> > +`USE_SPLIT_PMD_PTLOCKS` is defined, used a lock embedded in the PMD physical
> > +page metadata in the form of a `struct ptdesc`, acquired by `pmd_ptdesc()`
> > +called from `pmd_lock()` and ultimately `__pte_alloc()`.
> > +
> > +Finally, modifying the contents of the PTE has special treatment, as this is a
> > +lock that we must acquire whenever we want stable and exclusive access to
> > +entries pointing to data pages within a PTE, especially when we wish to modify
> > +them.
> > +
> > +This is performed via `pte_offset_map_lock()` which carefully checks to ensure
> > +that the PTE hasn't changed from under us, ultimately invoking `pte_lockptr()`
> > +to obtain a spin lock at PTE granularity contained within the `struct ptdesc`
> > +associated with the physical PTE page. The lock must be released via
> > +`pte_unmap_unlock()`.
> > +
> > +.. note::
> > +   There are some variants on this, such as `pte_offset_map_rw_nolock()` when we
> > +   know we hold the PTE stable but for brevity we do not explore this.
> > +   See the comment for `__pte_offset_map_lock()` for more details.
> > +
> > +When modifying data in ranges we typically only wish to allocate higher page
> > +tables as necessary, using these locks to avoid races or overwriting anything,
> > +and set/clear data at the PTE level as required (for instance when page faulting
> > +or zapping).
> > +
> > +Page table teardown
> > +-------------------
> > +
> > +Tearing down page tables themselves is something that requires significant
> > +care. There must be no way that page tables designated for removal can be
> > +traversed or referenced by concurrent tasks.
> > +
> > +It is insufficient to simply hold an mmap write lock and VMA lock (which will
> > +prevent racing faults, and rmap operations), as a file-backed mapping can be
> > +truncated under the `struct address_space` i_mmap_lock alone.
> > +
> > +As a result, no VMA which can be accessed via the reverse mapping (either
> > +anon_vma or the `struct address_space->i_mmap` interval tree) can have its page
> > +tables torn down.
> > +
> > +The operation is typically performed via `free_pgtables()`, which assumes either
> > +the mmap write lock has been taken (as specified by its `mm_wr_locked`
> > +parameter), or that it the VMA is fully detached.
> > +
> > +It carefully removes the VMA from all reverse mappings, however it's important
> > +that no new ones overlap these or any route remain to permit access to addresses
> > +within the range whose page tables are being torn down.
> > +
> > +As a result of these careful conditions, note that page table entries are
> > +cleared without page table locks, as it is assumed that all of these precautions
> > +have already been taken.
> > +
> > +mmap write lock downgrading
> > +---------------------------
> > +
> > +While it is possible to obtain an mmap write or read lock using the
> > +`mm->mmap_lock` read/write semaphore, it is also possible to **downgrade** from
> > +a write lock to a read lock via `mmap_write_downgrade()`.
> > +
> > +Similar to `mmap_write_unlock()`, this implicitly terminates all VMA write locks
> > +via `vma_end_write_all()` (more or this behaviour in the VMA lock internals
> > +section below), but importantly does not relinquish the mmap lock while
> > +downgrading, therefore keeping the locked virtual address space stable.
> > +
> > +A subtlety here is that callers can assume, if they invoke an
> > +mmap_write_downgrade() operation, that they still have exclusive access to the
> > +virtual address space (excluding VMA read lock holders), as for another task to
> > +have downgraded they would have had to have exclusive access to the semaphore
> > +which can't be the case until the current task completes what it is doing.
>
> I can't decipher the above paragraph. Could you please dumb it down
> for the likes of me?

Since you're smarter than me this indicates I am not being clear here :)
Actually reading this again I've not expressed this correctly.

This is something Jann mentioned, that I hadn't thought of before.

So if you have an mmap write lock, you have exclusive access to the mmap
(with the usual caveats about racing vma locks unless you vma write lock).

When you downgrade you now have a read lock - but because you were
exclusive earlier in the function AND any new caller of the function will
have to acquire that same write lock FIRST, they all have to wait on you
and therefore you have exclusive access to the mmap only with a read map.

So you are actually guaranteed that nobody else can be racing you _in that
function_, and equally no other writers can arise until you're done as your
holding the read lock prevents that.

Jann - correct me if I'm wrong or missing something here.

Will correct this unless Jann tells me I'm missing something on this :)

>
> > +
> > +Stack expansion
> > +---------------
> > +
> > +Stack expansion throws up additional complexities in that we cannot permit there
> > +to be racing page faults, as a result we invoke `vma_start_write()` to prevent
> > +this in `expand_downwards()` or `expand_upwards()`.
> > +
> > +Lock ordering
> > +-------------
> > +
> > +As we have multiple locks across the kernel which may or may not be taken at the
> > +same time as explicit mm or VMA locks, we have to be wary of lock inversion, and
> > +the **order** in which locks are acquired and released becomes very important.
> > +
> > +.. note::
> > +
> > +   Lock inversion occurs when two threads need to acquire multiple locks,
> > +   but in doing so inadvertently cause a mutual deadlock.
> > +
> > +   For example, consider thread 1 which holds lock A and tries to acquire lock B,
> > +   while thread 2 holds lock B and tries to acquire lock A.
> > +
> > +   Both threads are now deadlocked on each other. However, had they attempted to
> > +   acquire locks in the same order, one would have waited for the other to
> > +   complete its work and no deadlock would have occurred.
> > +
> > +The opening comment in `mm/rmap.c` describes in detail the required ordering of
> > +locks within memory management code:
> > +
> > +.. code-block::
> > +
> > +  inode->i_rwsem       (while writing or truncating, not reading or faulting)
> > +    mm->mmap_lock
> > +      mapping->invalidate_lock (in filemap_fault)
> > +        folio_lock
> > +          hugetlbfs_i_mmap_rwsem_key (in huge_pmd_share, see hugetlbfs below)
> > +            vma_start_write
> > +              mapping->i_mmap_rwsem
> > +                anon_vma->rwsem
> > +                  mm->page_table_lock or pte_lock
> > +                    swap_lock (in swap_duplicate, swap_info_get)
> > +                      mmlist_lock (in mmput, drain_mmlist and others)
> > +                      mapping->private_lock (in block_dirty_folio)
> > +                          i_pages lock (widely used)
> > +                            lruvec->lru_lock (in folio_lruvec_lock_irq)
> > +                      inode->i_lock (in set_page_dirty's __mark_inode_dirty)
> > +                      bdi.wb->list_lock (in set_page_dirty's __mark_inode_dirty)
> > +                        sb_lock (within inode_lock in fs/fs-writeback.c)
> > +                        i_pages lock (widely used, in set_page_dirty,
> > +                                  in arch-dependent flush_dcache_mmap_lock,
> > +                                  within bdi.wb->list_lock in __sync_single_inode)
> > +
> > +Please check the current state of this comment which may have changed since the
> > +time of writing of this document.
> > +
> > +VMA lock internals
> > +------------------
> > +
> > +The VMA lock mechanism is designed to be a lightweight means of avoiding the use
> > +of the heavily contended mmap lock. It is implemented using a combination of a
> > +read/write semaphore and sequence numbers belonging to the containing `struct
> > +mm_struct` and the VMA.
> > +
> > +Read locks are acquired via `vma_start_read()`, which is an optimistic
> > +operation, i.e. it tries to acquire a read lock but returns false if it is
> > +unable to do so. At the end of the read operation, `vma_end_read()` is called to
> > +release the VMA read lock. This can be done under RCU alone.
> > +
> > +Writing requires the mmap to be write-locked and the VMA lock to be acquired via
> > +`vma_start_write()`, however the write lock is released by the termination or
> > +downgrade of the mmap write lock so no `vma_end_write()` is required.
> > +
> > +All this is achieved by the use of per-mm and per-VMA sequence counts. This is
> > +used to reduce complexity, and potential especially around operations which
>
> potential?

Yeah sorry this sentence is completely mangled, will fix!

>
> > +write-lock multiple VMAs at once.
> > +
> > +If the mm sequence count, `mm->mm_lock_seq` is equal to the VMA sequence count
> > +`vma->vm_lock_seq` then the VMA is write-locked. If they differ, then they are
> > +not.
> > +
> > +Each time an mmap write lock is acquired in `mmap_write_lock()`,
> > +`mmap_write_lock_nested()`, `mmap_write_lock_killable()`, the `mm->mm_lock_seq`
> > +sequence number is incremented via `mm_lock_seqcount_begin()`.
> > +
> > +Each time the mmap write lock is released in `mmap_write_unlock()` or
> > +`mmap_write_downgrade()`, `vma_end_write_all()` is invoked which also increments
> > +`mm->mm_lock_seq` via `mm_lock_seqcount_end()`.
> > +
> > +This way, we ensure regardless of the VMA's sequence number count, that a write
> > +lock is not incorrectly indicated (since we increment the sequence counter on
> > +acquiring the mmap write lock, which is required in order to obtain a VMA write
> > +lock), and that when we release an mmap write lock, we efficiently release
> > +**all** VMA write locks contained within the mmap at the same time.
>
> Ok, I see that you describe some of the rules I mentioned before here.
> Up to you where to place them.

Yeah may rearrange a little in general to clear things up a bit.

I wanted a bit on the internals here, but then I end up mentioning so much
of this above that maybe it's a bit duplicative... let's see how I do on
the respin :)

>
> > +
> > +The exclusivity of the mmap write lock ensures this is what we want, as there
> > +would never be a reason to persist per-VMA write locks across multiple mmap
> > +write lock acquisitions.
> > +
> > +Each time a VMA read lock is acquired, we acquire a read lock on the
> > +`vma->vm_lock` read/write semaphore and hold it, while checking that the
> > +sequence count of the VMA does not match that of the mm.
> > +
> > +If it does, the read lock fails. If it does not, we hold the lock, excluding
> > +writers, but permitting other readers, who will also obtain this lock under RCU.
> > +
> > +Importantly, maple tree operations performed in `lock_vma_under_rcu()` are also
> > +RCU safe, so the whole read lock operation is guaranteed to function correctly.
> > +
> > +On the write side, we acquire a write lock on the `vma->vm_lock` read/write
> > +semaphore, before setting the VMA's sequence number under this lock, also
> > +simultaneously holding the mmap write lock.
> > +
> > +This way, if any read locks are in effect, `vma_start_write()` will sleep until
> > +these are finished and mutual exclusion is achieved.
> > +
> > +After setting the VMA's sequence number, the lock is released, avoiding
> > +complexity with a long-term held write lock.
> > +
> > +This clever combination of a read/write semaphore and sequence count allows for
> > +fast RCU-based per-VMA lock acquisition (especially on x86-64 page fault, though
> > +utilised elsewhere) with minimal complexity around lock ordering.
> > --
> > 2.47.0
> >

Thanks for the review! Will try to clarify and tighten things up generally
on top of the points you and the other reviewers have raised here and
hopefully v2 should be nice and sharp!
Suren Baghdasaryan Nov. 4, 2024, 9:20 p.m. UTC | #16
On Mon, Nov 4, 2024 at 1:04 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Mon, Nov 04, 2024 at 09:01:46AM -0800, Suren Baghdasaryan wrote:
> > On Fri, Nov 1, 2024 at 11:51 AM Lorenzo Stoakes
> > <lorenzo.stoakes@oracle.com> wrote:
> > >
> > > Locking around VMAs is complicated and confusing. While we have a number of
> > > disparate comments scattered around the place, we seem to be reaching a
> > > level of complexity that justifies a serious effort at clearly documenting
> > > how locks are expected to be interacted with when it comes to interacting
> > > with mm_struct and vm_area_struct objects.
> > >
> > > This is especially pertinent as regards efforts to find sensible
> > > abstractions for these fundamental objects within the kernel rust
> > > abstraction whose compiler strictly requires some means of expressing these
> > > rules (and through this expression can help self-document these
> > > requirements as well as enforce them which is an exciting concept).
> > >
> > > The document limits scope to mmap and VMA locks and those that are
> > > immediately adjacent and relevant to them - so additionally covers page
> > > table locking as this is so very closely tied to VMA operations (and relies
> > > upon us handling these correctly).
> > >
> > > The document tries to cover some of the nastier and more confusing edge
> > > cases and concerns especially around lock ordering and page table teardown.
> > >
> > > The document also provides some VMA lock internals, which are up to date
> > > and inclusive of recent changes to recent sequence number changes.
> > >
> > > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> >
> > Thanks for documenting this, Lorenzo!
>
> No worries, I feel it's very important to document this at this stage.
>
> > Just heads-up, I'm working on changing some of the implementation
> > details (removing vma->detached, moving vm_lock into vm_area_struct,
> > etc.). I should be able to post those changes sometime later this week
> > if testing does not reveal any issues.
>
> Ack yeah we can update as we go, as for removing vma->detached, how are we able
> to do this?
>
> My understanding is that detached VMAs are ones that are being removed (due
> to e.g.  merge/MAP_FIXED mmap()/munmap()) that are due to be RCU freed (as
> vm_area_free() does this via call_rcu() so delays until grace period), but
> which have been VMA unlocked prior to the grace period so
> lock_vma_under_rcu() might grab but shouldn't do anything with + retry.
>
> Will there be a new means of determining this?
>
> Anyway... we can update as we go :)
>
> >
> > > ---
> > >
> > > REVIEWERS NOTES:
> > >    You can speed up doc builds by running `make SPHINXDIRS=mm htmldocs`. I
> > >    also uploaded a copy of this to my website at
> > >    https://ljs.io/output/mm/vma_locks to make it easier to have a quick
> > >    read through. Thanks!
> > >
> > >
> > >  Documentation/mm/index.rst     |   1 +
> > >  Documentation/mm/vma_locks.rst | 527 +++++++++++++++++++++++++++++++++
> > >  2 files changed, 528 insertions(+)
> > >  create mode 100644 Documentation/mm/vma_locks.rst
> > >
> > > diff --git a/Documentation/mm/index.rst b/Documentation/mm/index.rst
> > > index 0be1c7503a01..da5f30acaca5 100644
> > > --- a/Documentation/mm/index.rst
> > > +++ b/Documentation/mm/index.rst
> > > @@ -64,3 +64,4 @@ documentation, or deleted if it has served its purpose.
> > >     vmemmap_dedup
> > >     z3fold
> > >     zsmalloc
> > > +   vma_locks
> > > diff --git a/Documentation/mm/vma_locks.rst b/Documentation/mm/vma_locks.rst
> > > new file mode 100644
> > > index 000000000000..52b9d484376a
> > > --- /dev/null
> > > +++ b/Documentation/mm/vma_locks.rst
> > > @@ -0,0 +1,527 @@
> > > +VMA Locking
> > > +===========
> > > +
> > > +Overview
> > > +--------
> > > +
> > > +Userland memory ranges are tracked by the kernel via Virtual Memory Areas or
> > > +'VMA's of type `struct vm_area_struct`.
> > > +
> > > +Each VMA describes a virtually contiguous memory range with identical
> > > +attributes, each of which described by a `struct vm_area_struct`
> > > +object. Userland access outside of VMAs is invalid except in the case where an
> > > +adjacent stack VMA could be extended to contain the accessed address.
> > > +
> > > +All VMAs are contained within one and only one virtual address space, described
> > > +by a `struct mm_struct` object which is referenced by all tasks (that is,
> > > +threads) which share the virtual address space. We refer to this as the `mm`.
> > > +
> > > +Each mm object contains a maple tree data structure which describes all VMAs
> > > +within the virtual address space.
> > > +
> > > +The kernel is designed to be highly scalable against concurrent access to
> > > +userland memory, so a complicated set of locks are required to ensure no data
> > > +races or memory corruption occurs.
> > > +
> > > +This document explores this locking in detail.
> > > +
> > > +.. note::
> > > +
> > > +   There are three different things that a user might want to achieve via
> > > +   locks - the first of which is **stability**. That is - ensuring that the VMA
> > > +   won't be freed or modified in any way from underneath us.
> > > +
> > > +   All MM and VMA locks ensure stability.
> > > +
> > > +   Secondly we have locks which allow **reads** but not writes (and which might
> > > +   be held concurrent with other CPUs who also hold the read lock).
> > > +
> > > +   Finally, we have locks which permit exclusive access to the VMA to allow for
> > > +   **writes** to the VMA.
> > > +
> > > +MM and VMA locks
> > > +----------------
> > > +
> > > +There are two key classes of lock utilised when reading and manipulating VMAs -
> > > +the `mmap_lock` which is a read/write semaphore maintained at the `mm_struct`
> > > +level of granularity and, if CONFIG_PER_VMA_LOCK is set, a per-VMA lock at the
> > > +VMA level of granularity.
> > > +
> > > +.. note::
> > > +
> > > +   Generally speaking, a read/write semaphore is a class of lock which permits
> > > +   concurrent readers. However a write lock can only be obtained once all
> > > +   readers have left the critical region (and pending readers made to wait).
> > > +
> > > +   This renders read locks on a read/write semaphore concurrent with other
> > > +   readers and write locks exclusive against all others holding the semaphore.
> > > +
> > > +If CONFIG_PER_VMA_LOCK is not set, then things are relatively simple - a write
> > > +mmap lock gives you exclusive write access to a VMA, and a read lock gives you
> > > +concurrent read-only access.
> > > +
> > > +In the presence of CONFIG_PER_VMA_LOCK, i.e. VMA locks, things are more
> > > +complicated. In this instance, a write semaphore is no longer enough to gain
> > > +exclusive access to a VMA, a VMA write lock is also required.
> >
> > I think "exclusive access to a VMA" should be "exclusive access to mm"
> > if you are talking about mmap_lock.
>
> Right, but in the past an mm write lock was sufficient to gain exclusive
> access to a _vma_. I will adjust to say 'write semaphore on the mm'.
>
> >
> > I think it's worth adding here:
> > 1. to take a VMA write-lock you need to be holding an mmap_lock;
> > 2. write-unlocking mmap_lock drops all VMA write locks in that mm.
> >
> > I see that you touch on this in the below "Note" section but that's
> > not an implementation detail but the designed behavior, so I think
> > these should not be mere side-notes.
>
> Right yeah I do mention both of these, but perhaps it's worth explicitly
> saying this right at the top. Will add.
>
> >
> > > +
> > > +The VMA lock is implemented via the use of both a read/write semaphore and
> > > +per-VMA and per-mm sequence numbers. We go into detail on this in the VMA lock
> > > +internals section below, so for the time being it is important only to note that
> > > +we can obtain either a VMA read or write lock.
> > > +
> > > +.. note::
> > > +
> > > +   VMAs under VMA **read** lock are obtained by the `lock_vma_under_rcu()`
> > > +   function, and **no** existing mmap or VMA lock must be held, This function
> >
> > "no existing mmap or VMA lock must be held" did you mean to say "no
> > exclusive mmap or VMA locks must be held"? Because one can certainly
> > hold a read-lock on them.
>
> Hmm really? You can hold an mmap read lock and obtain a VMA read lock too
> irrespective of that?
>
> OK, my mistake will update this and the table below to reflect this,
> thanks!
>
> Also I see that this part that was in a 'note' section is probably a bit
> wordy, which somewhat takes away from the key messages, will try to trim a
> bit or separate out to make things clearer.
>
> >
> > > +   either returns a read-locked VMA, or NULL if the lock could not be
> > > +   acquired. As the name suggests, the VMA will be acquired under RCU, though
> > > +   once obtained, remains stable.
> > > +
> > > +   This kind of locking is entirely optimistic - if the lock is contended or a
> > > +   competing write has started, then we do not obtain a read lock.
> > > +
> > > +   The `lock_vma_under_rcu()` function first calls `rcu_read_lock()` to ensure
> > > +   that the VMA is acquired in an RCU critical section, then attempts to VMA
> > > +   lock it via `vma_start_read()`, before releasing the RCU lock via
> > > +   `rcu_read_unlock()`.
> > > +
> > > +   VMA read locks hold the a read lock on the `vma->vm_lock` semaphore for their
> > > +   duration and the caller of `lock_vma_under_rcu()` must release it via
> > > +   `vma_end_read()`.
> > > +
> > > +   VMA **write** locks are acquired via `vma_start_write()` in instances where a
> > > +   VMA is about to be modified, unlike `vma_start_read()` the lock is always
> > > +   acquired. An mmap write lock **must** be held for the duration of the VMA
> > > +   write lock, releasing or downgrading the mmap write lock also releases the
> > > +   VMA write lock so there is no `vma_end_write()` function.
> > > +
> > > +   Note that a semaphore write lock is not held across a VMA lock. Rather, a
> > > +   sequence number is used for serialisation, and the write semaphore is only
> > > +   acquired at the point of write lock to update this (we explore this in detail
> > > +   in the VMA lock internals section below).
> > > +
> > > +   This ensures the semantics we require - VMA write locks provide exclusive
> > > +   write access to the VMA.
> > > +
> > > +Examining all valid lock state and what each implies:
> > > +
> > > +.. list-table::
> > > +   :header-rows: 1
> > > +
> > > +   * - mmap lock
> > > +     - VMA lock
> > > +     - Stable?
> > > +     - Can read safely?
> > > +     - Can write safely?
> > > +   * - \-
> > > +     - \-
> > > +     - N
> > > +     - N
> > > +     - N
> > > +   * - R
> > > +     - \-
> > > +     - Y
> > > +     - Y
> > > +     - N
> > > +   * - \-
> > > +     - R
> > > +     - Y
> > > +     - Y
> > > +     - N
> > > +   * - W
> > > +     - \-
> > > +     - Y
> > > +     - Y
> > > +     - N
> > > +   * - W
> > > +     - W
> > > +     - Y
> > > +     - Y
> > > +     - Y
> > > +
> > > +Note that there are some exceptions to this - the `anon_vma` field is permitted
> > > +to be written to under mmap read lock and is instead serialised by the `struct
> > > +mm_struct` field `page_table_lock`. In addition the `vm_mm` and all
> > > +lock-specific fields are permitted to be read under RCU alone  (though stability cannot
> > > +be expected in this instance).
> > > +
> > > +.. note::
> > > +   The most notable place to use the VMA read lock is on page table faults on
> > > +   the x86-64 architecture, which importantly means that without a VMA write
> >
> > As Jann mentioned, CONFIG_PER_VMA_LOCK is supported on many more architectures.
>
> Yes, have updated to say so already. Sorry I was behind on how much this
> had progressed :>)
>
> >
> > > +   lock, page faults can race against you even if you hold an mmap write lock.
> > > +
> > > +VMA Fields
> > > +----------
> > > +
> > > +We examine each field of the `struct vm_area_struct` type in detail in the table
> > > +below.
> > > +
> > > +Reading of each field requires either an mmap read lock or a VMA read lock to be
> > > +held, except where 'unstable RCU read' is specified, in which case unstable
> > > +access to the field is permitted under RCU alone.
> > > +
> > > +The table specifies which write locks must be held to write to the field.
> > > +
> > > +.. list-table::
> > > +   :widths: 20 10 22 5 20
> > > +   :header-rows: 1
> > > +
> > > +   * - Field
> > > +     - Config
> > > +     - Description
> > > +     - Unstable RCU read?
> > > +     - Write Lock
> > > +   * - vm_start
> > > +     -
> > > +     - Inclusive start virtual address of range VMA describes.
> > > +     -
> > > +     - mmap write, VMA write
> > > +   * - vm_end
> > > +     -
> > > +     - Exclusive end virtual address of range VMA describes.
> > > +     -
> > > +     - mmap write, VMA write
> > > +   * - vm_rcu
> > > +     - vma lock
> > > +     - RCU list head, in union with vma_start, vma_end. RCU implementation detail.
> > > +     - N/A
> > > +     - N/A
> > > +   * - vm_mm
> > > +     -
> > > +     - Containing mm_struct.
> > > +     - Y
> > > +     - (Static)
> > > +   * - vm_page_prot
> > > +     -
> > > +     - Architecture-specific page table protection bits determined from VMA
> > > +       flags
> > > +     -
> > > +     - mmap write, VMA write
> > > +   * - vm_flags
> > > +     -
> > > +     - Read-only access to VMA flags describing attributes of VMA, in union with
> > > +       private writable `__vm_flags`.
> > > +     -
> > > +     - N/A
> > > +   * - __vm_flags
> > > +     -
> > > +     - Private, writable access to VMA flags field, updated by vm_flags_*()
> > > +       functions.
> > > +     -
> > > +     - mmap write, VMA write
> > > +   * - detached
> > > +     - vma lock
> > > +     - VMA lock implementation detail - indicates whether the VMA has been
> > > +       detached from the tree.
> > > +     - Y
> > > +     - mmap write, VMA write
> > > +   * - vm_lock_seq
> > > +     - vma lock
> > > +     - VMA lock implementation detail - A sequence number used to serialise the
> > > +       VMA lock, see the VMA lock section below.
> > > +     - Y
> > > +     - mmap write, VMA write
> >
> > It's a bit weird to state that VMA write-lock is required when talking
> > about vm_lock_seq/vm_lock themselves being parts of that lock. I would
> > simply say N/A for both of them since they should not be modified
> > directly.
>
> Ack will adjust.
>
> >
> > > +   * - vm_lock
> > > +     - vma lock
> > > +     - VMA lock implementation detail - A pointer to the VMA lock read/write
> > > +       semaphore.
> > > +     - Y
> > > +     - None required
> > > +   * - shared.rb
> > > +     -
> > > +     - A red/black tree node used, if the mapping is file-backed, to place the
> > > +       VMA in the `struct address_space->i_mmap` red/black interval tree.
> > > +     -
> > > +     - mmap write, VMA write, i_mmap write
> > > +   * - shared.rb_subtree_last
> > > +     -
> > > +     - Metadata used for management of the interval tree if the VMA is
> > > +       file-backed.
> > > +     -
> > > +     - mmap write, VMA write, i_mmap write
> > > +   * - anon_vma_chain
> > > +     -
> > > +     - List of links to forked/CoW'd `anon_vma` objects.
> > > +     -
> > > +     - mmap read or above, anon_vma write lock
> >
> > nit: I would spell it out for clarity: mmap read or write
>
> Ack will fix
>
> >
> > > +   * - anon_vma
> > > +     -
> > > +     - `anon_vma` object used by anonymous folios mapped exclusively to this VMA.
> > > +     -
> > > +     - mmap read or above, page_table_lock
> > > +   * - vm_ops
> > > +     -
> > > +     - If the VMA is file-backed, then either the driver or file-system provides
> > > +       a `struct vm_operations_struct` object describing callbacks to be invoked
> > > +       on specific VMA lifetime events.
> > > +     -
> > > +     - (Static)
> > > +   * - vm_pgoff
> > > +     -
> > > +     - Describes the page offset into the file, the original page offset within
> > > +       the virtual address space (prior to any `mremap()`), or PFN if a PFN map.
> > > +     -
> > > +     - mmap write, VMA write
> > > +   * - vm_file
> > > +     -
> > > +     - If the VMA is file-backed, points to a `struct file` object describing
> > > +       the underlying file, if anonymous then `NULL`.
> > > +     -
> > > +     - (Static)
> > > +   * - vm_private_data
> > > +     -
> > > +     - A `void *` field for driver-specific metadata.
> > > +     -
> > > +     - Driver-mandated.
> > > +   * - anon_name
> > > +     - anon name
> > > +     - A field for storing a `struct anon_vma_name` object providing a name for
> > > +       anonymous mappings, or `NULL` if none is set or the VMA is file-backed.
> > > +     -
> > > +     - mmap write, VMA write
> > > +   * - swap_readahead_info
> > > +     - swap
> > > +     - Metadata used by the swap mechanism to perform readahead.
> > > +     -
> > > +     - mmap read
> > > +   * - vm_region
> > > +     - nommu
> > > +     - The containing region for the VMA for architectures which do not
> > > +       possess an MMU.
> > > +     - N/A
> > > +     - N/A
> > > +   * - vm_policy
> > > +     - numa
> > > +     - `mempolicy` object which describes NUMA behaviour of the VMA.
> > > +     -
> > > +     - mmap write, VMA write
> > > +   * - numab_state
> > > +     - numab
> > > +     - `vma_numab_state` object which describes the current state of NUMA
> > > +       balancing in relation to this VMA.
> > > +     -
> > > +     - mmap write, VMA write
> > > +   * - vm_userfaultfd_ctx
> > > +     -
> > > +     - Userfaultfd context wrapper object of type `vm_userfaultfd_ctx`, either
> > > +       of zero size if userfaultfd is disabled, or containing a pointer to an
> > > +       underlying `userfaultfd_ctx` object which describes userfaultfd metadata.
> > > +     -
> > > +     - mmap write, VMA write
> > > +
> > > +.. note::
> > > +
> > > +   In the config column 'vma lock' configuration means CONFIG_PER_VMA_LOCK,
> > > +   'anon name' means CONFIG_ANON_VMA_NAME, 'swap' means CONFIG_SWAP, 'nommu'
> > > +   means that CONFIG_MMU is not set, 'numa' means CONFIG_NUMA and 'numab' means
> > > +   CONFIG_NUMA_BALANCING'.
> > > +
> > > +   In the write lock column '(Static)' means that the field is set only once
> > > +   upon initialisation of the VMA and not changed after this, the VMA would
> > > +   either have been under an mmap write and VMA write lock at the time or not
> > > +   yet inserted into any tree.
> > > +
> > > +Page table locks
> > > +----------------
> > > +
> > > +When allocating a P4D, PUD or PMD and setting the relevant entry in the above
> > > +PGD, P4D or PUD, the `mm->page_table_lock` is acquired to do so. This is
> > > +acquired in `__p4d_alloc()`, `__pud_alloc()` and `__pmd_alloc()` respectively.
> > > +
> > > +.. note::
> > > +   `__pmd_alloc()` actually invokes `pud_lock()` and `pud_lockptr()` in turn,
> > > +   however at the time of writing it ultimately references the
> > > +   `mm->page_table_lock`.
> > > +
> > > +Allocating a PTE will either use the `mm->page_table_lock` or, if
> > > +`USE_SPLIT_PMD_PTLOCKS` is defined, used a lock embedded in the PMD physical
> > > +page metadata in the form of a `struct ptdesc`, acquired by `pmd_ptdesc()`
> > > +called from `pmd_lock()` and ultimately `__pte_alloc()`.
> > > +
> > > +Finally, modifying the contents of the PTE has special treatment, as this is a
> > > +lock that we must acquire whenever we want stable and exclusive access to
> > > +entries pointing to data pages within a PTE, especially when we wish to modify
> > > +them.
> > > +
> > > +This is performed via `pte_offset_map_lock()` which carefully checks to ensure
> > > +that the PTE hasn't changed from under us, ultimately invoking `pte_lockptr()`
> > > +to obtain a spin lock at PTE granularity contained within the `struct ptdesc`
> > > +associated with the physical PTE page. The lock must be released via
> > > +`pte_unmap_unlock()`.
> > > +
> > > +.. note::
> > > +   There are some variants on this, such as `pte_offset_map_rw_nolock()` when we
> > > +   know we hold the PTE stable but for brevity we do not explore this.
> > > +   See the comment for `__pte_offset_map_lock()` for more details.
> > > +
> > > +When modifying data in ranges we typically only wish to allocate higher page
> > > +tables as necessary, using these locks to avoid races or overwriting anything,
> > > +and set/clear data at the PTE level as required (for instance when page faulting
> > > +or zapping).
> > > +
> > > +Page table teardown
> > > +-------------------
> > > +
> > > +Tearing down page tables themselves is something that requires significant
> > > +care. There must be no way that page tables designated for removal can be
> > > +traversed or referenced by concurrent tasks.
> > > +
> > > +It is insufficient to simply hold an mmap write lock and VMA lock (which will
> > > +prevent racing faults, and rmap operations), as a file-backed mapping can be
> > > +truncated under the `struct address_space` i_mmap_lock alone.
> > > +
> > > +As a result, no VMA which can be accessed via the reverse mapping (either
> > > +anon_vma or the `struct address_space->i_mmap` interval tree) can have its page
> > > +tables torn down.
> > > +
> > > +The operation is typically performed via `free_pgtables()`, which assumes either
> > > +the mmap write lock has been taken (as specified by its `mm_wr_locked`
> > > +parameter), or that it the VMA is fully detached.
> > > +
> > > +It carefully removes the VMA from all reverse mappings, however it's important
> > > +that no new ones overlap these or any route remain to permit access to addresses
> > > +within the range whose page tables are being torn down.
> > > +
> > > +As a result of these careful conditions, note that page table entries are
> > > +cleared without page table locks, as it is assumed that all of these precautions
> > > +have already been taken.
> > > +
> > > +mmap write lock downgrading
> > > +---------------------------
> > > +
> > > +While it is possible to obtain an mmap write or read lock using the
> > > +`mm->mmap_lock` read/write semaphore, it is also possible to **downgrade** from
> > > +a write lock to a read lock via `mmap_write_downgrade()`.
> > > +
> > > +Similar to `mmap_write_unlock()`, this implicitly terminates all VMA write locks
> > > +via `vma_end_write_all()` (more or this behaviour in the VMA lock internals
> > > +section below), but importantly does not relinquish the mmap lock while
> > > +downgrading, therefore keeping the locked virtual address space stable.
> > > +
> > > +A subtlety here is that callers can assume, if they invoke an
> > > +mmap_write_downgrade() operation, that they still have exclusive access to the
> > > +virtual address space (excluding VMA read lock holders), as for another task to
> > > +have downgraded they would have had to have exclusive access to the semaphore
> > > +which can't be the case until the current task completes what it is doing.
> >
> > I can't decipher the above paragraph. Could you please dumb it down
> > for the likes of me?
>
> Since you're smarter than me this indicates I am not being clear here :)
> Actually reading this again I've not expressed this correctly.
>
> This is something Jann mentioned, that I hadn't thought of before.
>
> So if you have an mmap write lock, you have exclusive access to the mmap
> (with the usual caveats about racing vma locks unless you vma write lock).
>
> When you downgrade you now have a read lock - but because you were
> exclusive earlier in the function AND any new caller of the function will
> have to acquire that same write lock FIRST, they all have to wait on you
> and therefore you have exclusive access to the mmap only with a read map.
>
> So you are actually guaranteed that nobody else can be racing you _in that
> function_, and equally no other writers can arise until you're done as your
> holding the read lock prevents that.

I guess you could simplify this description by saying that downgrading
a write-lock to read-lock still guarantees that there are no writers
until you drop that read-lock.

>
> Jann - correct me if I'm wrong or missing something here.
>
> Will correct this unless Jann tells me I'm missing something on this :)
>
> >
> > > +
> > > +Stack expansion
> > > +---------------
> > > +
> > > +Stack expansion throws up additional complexities in that we cannot permit there
> > > +to be racing page faults, as a result we invoke `vma_start_write()` to prevent
> > > +this in `expand_downwards()` or `expand_upwards()`.
> > > +
> > > +Lock ordering
> > > +-------------
> > > +
> > > +As we have multiple locks across the kernel which may or may not be taken at the
> > > +same time as explicit mm or VMA locks, we have to be wary of lock inversion, and
> > > +the **order** in which locks are acquired and released becomes very important.
> > > +
> > > +.. note::
> > > +
> > > +   Lock inversion occurs when two threads need to acquire multiple locks,
> > > +   but in doing so inadvertently cause a mutual deadlock.
> > > +
> > > +   For example, consider thread 1 which holds lock A and tries to acquire lock B,
> > > +   while thread 2 holds lock B and tries to acquire lock A.
> > > +
> > > +   Both threads are now deadlocked on each other. However, had they attempted to
> > > +   acquire locks in the same order, one would have waited for the other to
> > > +   complete its work and no deadlock would have occurred.
> > > +
> > > +The opening comment in `mm/rmap.c` describes in detail the required ordering of
> > > +locks within memory management code:
> > > +
> > > +.. code-block::
> > > +
> > > +  inode->i_rwsem       (while writing or truncating, not reading or faulting)
> > > +    mm->mmap_lock
> > > +      mapping->invalidate_lock (in filemap_fault)
> > > +        folio_lock
> > > +          hugetlbfs_i_mmap_rwsem_key (in huge_pmd_share, see hugetlbfs below)
> > > +            vma_start_write
> > > +              mapping->i_mmap_rwsem
> > > +                anon_vma->rwsem
> > > +                  mm->page_table_lock or pte_lock
> > > +                    swap_lock (in swap_duplicate, swap_info_get)
> > > +                      mmlist_lock (in mmput, drain_mmlist and others)
> > > +                      mapping->private_lock (in block_dirty_folio)
> > > +                          i_pages lock (widely used)
> > > +                            lruvec->lru_lock (in folio_lruvec_lock_irq)
> > > +                      inode->i_lock (in set_page_dirty's __mark_inode_dirty)
> > > +                      bdi.wb->list_lock (in set_page_dirty's __mark_inode_dirty)
> > > +                        sb_lock (within inode_lock in fs/fs-writeback.c)
> > > +                        i_pages lock (widely used, in set_page_dirty,
> > > +                                  in arch-dependent flush_dcache_mmap_lock,
> > > +                                  within bdi.wb->list_lock in __sync_single_inode)
> > > +
> > > +Please check the current state of this comment which may have changed since the
> > > +time of writing of this document.
> > > +
> > > +VMA lock internals
> > > +------------------
> > > +
> > > +The VMA lock mechanism is designed to be a lightweight means of avoiding the use
> > > +of the heavily contended mmap lock. It is implemented using a combination of a
> > > +read/write semaphore and sequence numbers belonging to the containing `struct
> > > +mm_struct` and the VMA.
> > > +
> > > +Read locks are acquired via `vma_start_read()`, which is an optimistic
> > > +operation, i.e. it tries to acquire a read lock but returns false if it is
> > > +unable to do so. At the end of the read operation, `vma_end_read()` is called to
> > > +release the VMA read lock. This can be done under RCU alone.
> > > +
> > > +Writing requires the mmap to be write-locked and the VMA lock to be acquired via
> > > +`vma_start_write()`, however the write lock is released by the termination or
> > > +downgrade of the mmap write lock so no `vma_end_write()` is required.
> > > +
> > > +All this is achieved by the use of per-mm and per-VMA sequence counts. This is
> > > +used to reduce complexity, and potential especially around operations which
> >
> > potential?
>
> Yeah sorry this sentence is completely mangled, will fix!
>
> >
> > > +write-lock multiple VMAs at once.
> > > +
> > > +If the mm sequence count, `mm->mm_lock_seq` is equal to the VMA sequence count
> > > +`vma->vm_lock_seq` then the VMA is write-locked. If they differ, then they are
> > > +not.
> > > +
> > > +Each time an mmap write lock is acquired in `mmap_write_lock()`,
> > > +`mmap_write_lock_nested()`, `mmap_write_lock_killable()`, the `mm->mm_lock_seq`
> > > +sequence number is incremented via `mm_lock_seqcount_begin()`.
> > > +
> > > +Each time the mmap write lock is released in `mmap_write_unlock()` or
> > > +`mmap_write_downgrade()`, `vma_end_write_all()` is invoked which also increments
> > > +`mm->mm_lock_seq` via `mm_lock_seqcount_end()`.
> > > +
> > > +This way, we ensure regardless of the VMA's sequence number count, that a write
> > > +lock is not incorrectly indicated (since we increment the sequence counter on
> > > +acquiring the mmap write lock, which is required in order to obtain a VMA write
> > > +lock), and that when we release an mmap write lock, we efficiently release
> > > +**all** VMA write locks contained within the mmap at the same time.
> >
> > Ok, I see that you describe some of the rules I mentioned before here.
> > Up to you where to place them.
>
> Yeah may rearrange a little in general to clear things up a bit.
>
> I wanted a bit on the internals here, but then I end up mentioning so much
> of this above that maybe it's a bit duplicative... let's see how I do on
> the respin :)
>
> >
> > > +
> > > +The exclusivity of the mmap write lock ensures this is what we want, as there
> > > +would never be a reason to persist per-VMA write locks across multiple mmap
> > > +write lock acquisitions.
> > > +
> > > +Each time a VMA read lock is acquired, we acquire a read lock on the
> > > +`vma->vm_lock` read/write semaphore and hold it, while checking that the
> > > +sequence count of the VMA does not match that of the mm.
> > > +
> > > +If it does, the read lock fails. If it does not, we hold the lock, excluding
> > > +writers, but permitting other readers, who will also obtain this lock under RCU.
> > > +
> > > +Importantly, maple tree operations performed in `lock_vma_under_rcu()` are also
> > > +RCU safe, so the whole read lock operation is guaranteed to function correctly.
> > > +
> > > +On the write side, we acquire a write lock on the `vma->vm_lock` read/write
> > > +semaphore, before setting the VMA's sequence number under this lock, also
> > > +simultaneously holding the mmap write lock.
> > > +
> > > +This way, if any read locks are in effect, `vma_start_write()` will sleep until
> > > +these are finished and mutual exclusion is achieved.
> > > +
> > > +After setting the VMA's sequence number, the lock is released, avoiding
> > > +complexity with a long-term held write lock.
> > > +
> > > +This clever combination of a read/write semaphore and sequence count allows for
> > > +fast RCU-based per-VMA lock acquisition (especially on x86-64 page fault, though
> > > +utilised elsewhere) with minimal complexity around lock ordering.
> > > --
> > > 2.47.0
> > >
>
> Thanks for the review! Will try to clarify and tighten things up generally
> on top of the points you and the other reviewers have raised here and
> hopefully v2 should be nice and sharp!

Sounds good! I think it would be good to separate design decisions (mm
should be write-locked before any VMA can be write-locked, all
write-locked VMAs and automatically unlocked once mm is
write-unlocked, etc) vs implementation details (lock_vma_under_rcu(),
vma->vm_lock_seq, mm->mm_lock_seq, etc). Easily said than done :)
Jann Horn Nov. 4, 2024, 9:25 p.m. UTC | #17
On Mon, Nov 4, 2024 at 10:04 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
> On Mon, Nov 04, 2024 at 09:01:46AM -0800, Suren Baghdasaryan wrote:
> > On Fri, Nov 1, 2024 at 11:51 AM Lorenzo Stoakes
> > <lorenzo.stoakes@oracle.com> wrote:
> > > +MM and VMA locks
> > > +----------------
> > > +
> > > +There are two key classes of lock utilised when reading and manipulating VMAs -
> > > +the `mmap_lock` which is a read/write semaphore maintained at the `mm_struct`
> > > +level of granularity and, if CONFIG_PER_VMA_LOCK is set, a per-VMA lock at the
> > > +VMA level of granularity.
> > > +
> > > +.. note::
> > > +
> > > +   Generally speaking, a read/write semaphore is a class of lock which permits
> > > +   concurrent readers. However a write lock can only be obtained once all
> > > +   readers have left the critical region (and pending readers made to wait).
> > > +
> > > +   This renders read locks on a read/write semaphore concurrent with other
> > > +   readers and write locks exclusive against all others holding the semaphore.
> > > +
> > > +If CONFIG_PER_VMA_LOCK is not set, then things are relatively simple - a write
> > > +mmap lock gives you exclusive write access to a VMA, and a read lock gives you
> > > +concurrent read-only access.
> > > +
> > > +In the presence of CONFIG_PER_VMA_LOCK, i.e. VMA locks, things are more
> > > +complicated. In this instance, a write semaphore is no longer enough to gain
> > > +exclusive access to a VMA, a VMA write lock is also required.
> >
> > I think "exclusive access to a VMA" should be "exclusive access to mm"
> > if you are talking about mmap_lock.
>
> Right, but in the past an mm write lock was sufficient to gain exclusive
> access to a _vma_. I will adjust to say 'write semaphore on the mm'.

We might want to introduce some explicit terminology for talking about
types of locks in MM at some point in this document. Like:

 - "high-level locks" (or "metadata locks"?) means mmap lock, VMA
lock, address_space lock, anon_vma lock

 - "pagetable-level locks" means page_table_lock and PMD/PTE spinlocks

 - "write-locked VMA" means mmap lock is held for writing and VMA has
been marked as write-lock

 - "rmap locks" means the address_space and anon_vma locks
   - "holding the rmap locks for writing" means holding both (if applicable)
   - "holding an rmap lock for reading" means holding one of them

 - "read-locked VMA" means either mmap lock held for reading or VMA
lock held for reading

That might make it a bit easier to write concise descriptions of
locking requirements in the rest of this document and keep them

> > > +The VMA lock is implemented via the use of both a read/write semaphore and
> > > +per-VMA and per-mm sequence numbers. We go into detail on this in the VMA lock
> > > +internals section below, so for the time being it is important only to note that
> > > +we can obtain either a VMA read or write lock.
> > > +
> > > +.. note::
> > > +
> > > +   VMAs under VMA **read** lock are obtained by the `lock_vma_under_rcu()`
> > > +   function, and **no** existing mmap or VMA lock must be held, This function
> >
> > "no existing mmap or VMA lock must be held" did you mean to say "no
> > exclusive mmap or VMA locks must be held"? Because one can certainly
> > hold a read-lock on them.
>
> Hmm really? You can hold an mmap read lock and obtain a VMA read lock too
> irrespective of that?

I think you can call lock_vma_under_rcu() while already holding the
mmap read lock, but only because lock_vma_under_rcu() has trylock
semantics. (The other way around leads to a deadlock: You can't take
the mmap read lock while holding a VMA read lock, because the VMA read
lock may prevent another task from write-locking a VMA after it has
already taken an mmap write lock.)

> > > +mmap write lock downgrading
> > > +---------------------------
> > > +
> > > +While it is possible to obtain an mmap write or read lock using the
> > > +`mm->mmap_lock` read/write semaphore, it is also possible to **downgrade** from
> > > +a write lock to a read lock via `mmap_write_downgrade()`.
> > > +
> > > +Similar to `mmap_write_unlock()`, this implicitly terminates all VMA write locks
> > > +via `vma_end_write_all()` (more or this behaviour in the VMA lock internals
> > > +section below), but importantly does not relinquish the mmap lock while
> > > +downgrading, therefore keeping the locked virtual address space stable.
> > > +
> > > +A subtlety here is that callers can assume, if they invoke an
> > > +mmap_write_downgrade() operation, that they still have exclusive access to the
> > > +virtual address space (excluding VMA read lock holders), as for another task to
> > > +have downgraded they would have had to have exclusive access to the semaphore
> > > +which can't be the case until the current task completes what it is doing.
> >
> > I can't decipher the above paragraph. Could you please dumb it down
> > for the likes of me?
>
> Since you're smarter than me this indicates I am not being clear here :)
> Actually reading this again I've not expressed this correctly.
>
> This is something Jann mentioned, that I hadn't thought of before.
>
> So if you have an mmap write lock, you have exclusive access to the mmap
> (with the usual caveats about racing vma locks unless you vma write lock).
>
> When you downgrade you now have a read lock - but because you were
> exclusive earlier in the function AND any new caller of the function will
> have to acquire that same write lock FIRST, they all have to wait on you
> and therefore you have exclusive access to the mmap only with a read map.
>
> So you are actually guaranteed that nobody else can be racing you _in that
> function_, and equally no other writers can arise until you're done as your
> holding the read lock prevents that.
>
> Jann - correct me if I'm wrong or missing something here.
>
> Will correct this unless Jann tells me I'm missing something on this :)

Yeah, basically you can hold an rwsem in three modes:

 - reader (R)
 - reader that results from downgrading a writer (D)
 - writer (W)

and this is the diagram of which excludes which (view it in monospace,
✔ means mutually exclusive):

  | R | D | W
==|===|===|===
R | ✘ | ✘ | ✔
--|---|---|---
D | ✘ | ✔ | ✔
--|---|---|---
W | ✔ | ✔ | ✔

So the special thing about downgraded-readers compared to normal
readers is that they exclude other downgraded-readers.
Jann Horn Nov. 4, 2024, 9:29 p.m. UTC | #18
On Mon, Nov 4, 2024 at 5:42 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
> On Sat, Nov 02, 2024 at 02:45:35AM +0100, Jann Horn wrote:
> > On Fri, Nov 1, 2024 at 7:50 PM Lorenzo Stoakes
> > <lorenzo.stoakes@oracle.com> wrote:
> > > Locking around VMAs is complicated and confusing. While we have a number of
> > > disparate comments scattered around the place, we seem to be reaching a
> > > level of complexity that justifies a serious effort at clearly documenting
> > > how locks are expected to be interacted with when it comes to interacting
> > > with mm_struct and vm_area_struct objects.
> > >
> > > This is especially pertinent as regards efforts to find sensible
> > > abstractions for these fundamental objects within the kernel rust
> > > abstraction whose compiler strictly requires some means of expressing these
> > > rules (and through this expression can help self-document these
> > > requirements as well as enforce them which is an exciting concept).
> > >
> > > The document limits scope to mmap and VMA locks and those that are
> > > immediately adjacent and relevant to them - so additionally covers page
> > > table locking as this is so very closely tied to VMA operations (and relies
> > > upon us handling these correctly).
> > >
> > > The document tries to cover some of the nastier and more confusing edge
> > > cases and concerns especially around lock ordering and page table teardown.
> > >
> > > The document also provides some VMA lock internals, which are up to date
> > > and inclusive of recent changes to recent sequence number changes.

> > > +Overview
> > > +--------
> > > +
> > > +Userland memory ranges are tracked by the kernel via Virtual Memory Areas or
> > > +'VMA's of type `struct vm_area_struct`.
> > > +
> > > +Each VMA describes a virtually contiguous memory range with identical
> > > +attributes, each of which described by a `struct vm_area_struct`
> > > +object. Userland access outside of VMAs is invalid except in the case where an
> > > +adjacent stack VMA could be extended to contain the accessed address.
> > > +
> > > +All VMAs are contained within one and only one virtual address space, described
> > > +by a `struct mm_struct` object which is referenced by all tasks (that is,
> > > +threads) which share the virtual address space. We refer to this as the `mm`.
> > > +
> > > +Each mm object contains a maple tree data structure which describes all VMAs
> > > +within the virtual address space.
> >
> > The gate VMA is special, on architectures that have it: Userland
> > access to its area is allowed, but the area is outside the VA range
> > managed by the normal MM code, and the gate VMA is a global object
> > (not per-MM), and only a few places in MM code can interact with it
> > (for example, page fault handling can't, but GUP can through
> > get_gate_page()).
> >
> > (I think this also has the fun consequence that vm_normal_page() can
> > get called on a VMA whose ->vm_mm is NULL, when called from
> > get_gate_page().)
>
> Yeah the gate page is weird, I'm not sure it's worth going into too much detail
> here, but perhaps a note explaining in effect 'except for the gate page..'
> unless you think it'd be valuable to go into that in more detail than a passing
> 'hey of course there's an exception to this!' comment? :)

Yeah I think that's good enough.

> > > +The kernel is designed to be highly scalable against concurrent access to
> > > +userland memory, so a complicated set of locks are required to ensure no data
> > > +races or memory corruption occurs.
> > > +
> > > +This document explores this locking in detail.
> > > +
> > > +.. note::
> > > +
> > > +   There are three different things that a user might want to achieve via
> > > +   locks - the first of which is **stability**. That is - ensuring that the VMA
> > > +   won't be freed or modified in any way from underneath us.
> > > +
> > > +   All MM and VMA locks ensure stability.
> > > +
> > > +   Secondly we have locks which allow **reads** but not writes (and which might
> > > +   be held concurrent with other CPUs who also hold the read lock).

(maybe also note more clearly here that "read" is talking about the
VMA metadata, so an operation that writes page table entries normally
counts as "read")

> > > +   Finally, we have locks which permit exclusive access to the VMA to allow for
> > > +   **writes** to the VMA.
> >
> > Maybe also mention that there are three major paths you can follow to
> > reach a VMA? You can come through the mm's VMA tree, you can do an
> > anon page rmap walk that goes page -> anon_vma -> vma, or you can do a
> > file rmap walk from the address_space. Which is why just holding the
> > mmap lock and vma lock in write mode is not enough to permit arbitrary
> > changes to a VMA struct.
>
> I totally agree that adding something about _where_ you can come from is a good
> idea, will do.
>
> However, in terms of the VMA itself, mmap lock and vma lock _are_ sufficient to
> prevent arbitrary _changes_ to the VMA struct right?

Yes. But the sentence "Finally, we have locks which permit exclusive
access to the VMA to allow for **writes** to the VMA" kinda sounds as
if there is a single lock you can take that allows you to write to the
VMA struct.

> It isn't sufficient to prevent _reading_ of vma metadata fields, nor walking of
> underlying page tables, so if you're going to do something that changes
> fundamentals you need to hide it from rmap.
>
> Maybe worth going over relevant fields? Or rather adding an additional 'read
> lock' column?
>
> vma->vm_mm ('static' anyway after VMA created)
> vma->vm_start (change on merge/split)

and on stack expansion :P
But I guess nowadays that's basically semantically just a special case
of merge, so no need to explicitly mention it here...

> vma->vm_end (change on merge/split)
> vma->vm_flags (can change)
> vma->vm_ops ('static' anyway after call_mmap())
>
> In any case this is absolutely _crucial_ I agree, will add.
>
> >
> > > +MM and VMA locks
> > > +----------------
> > > +
> > > +There are two key classes of lock utilised when reading and manipulating VMAs -
> > > +the `mmap_lock` which is a read/write semaphore maintained at the `mm_struct`
> > > +level of granularity and, if CONFIG_PER_VMA_LOCK is set, a per-VMA lock at the
> > > +VMA level of granularity.
> > > +
> > > +.. note::
> > > +
> > > +   Generally speaking, a read/write semaphore is a class of lock which permits
> > > +   concurrent readers. However a write lock can only be obtained once all
> > > +   readers have left the critical region (and pending readers made to wait).
> > > +
> > > +   This renders read locks on a read/write semaphore concurrent with other
> > > +   readers and write locks exclusive against all others holding the semaphore.
> > > +
> > > +If CONFIG_PER_VMA_LOCK is not set, then things are relatively simple - a write
> > > +mmap lock gives you exclusive write access to a VMA, and a read lock gives you
> > > +concurrent read-only access.
> > > +
> > > +In the presence of CONFIG_PER_VMA_LOCK, i.e. VMA locks, things are more
> > > +complicated. In this instance, a write semaphore is no longer enough to gain
> > > +exclusive access to a VMA, a VMA write lock is also required.
> > > +
> > > +The VMA lock is implemented via the use of both a read/write semaphore and
> > > +per-VMA and per-mm sequence numbers. We go into detail on this in the VMA lock
> > > +internals section below, so for the time being it is important only to note that
> > > +we can obtain either a VMA read or write lock.
> > > +
> > > +.. note::
> > > +
> > > +   VMAs under VMA **read** lock are obtained by the `lock_vma_under_rcu()`
> > > +   function, and **no** existing mmap or VMA lock must be held, This function
> >
> > uffd_move_lock() calls lock_vma_under_rcu() after having already
> > VMA-locked another VMA with uffd_lock_vma().
>
> Oh uffd, how we love you...
>
> I think it might be worth adding a note for this exception. Obviously they do
> some pretty careful manipulation to avoid issues here so probably worth saying
> 'hey except uffd'

I guess another way to write it would be something like:

"Taking the mmap lock in read mode while you're holding a vma lock is
forbidden because it can deadlock. Calling lock_vma_under_rcu()
normally only makes sense when you're not holding the mmap lock
(otherwise it would be redundant). lock_vma_under_rcu() has trylock
semantics, and if it fails you need a plan B (which normally is to
take the mmap lock in read mode instead; notably this would get more
annoying if you were already holding another VMA lock, because then
you'd have to drop that first)."?

> > > +   lock, page faults can race against you even if you hold an mmap write lock.
> > > +
> > > +VMA Fields
> > > +----------
> > > +
> > > +We examine each field of the `struct vm_area_struct` type in detail in the table
> > > +below.
> > > +
> > > +Reading of each field requires either an mmap read lock or a VMA read lock to be
> > > +held, except where 'unstable RCU read' is specified, in which case unstable
> > > +access to the field is permitted under RCU alone.
> > > +
> > > +The table specifies which write locks must be held to write to the field.
> >
> > vm_start, vm_end and vm_pgoff also require that the associated
> > address_space and anon_vma (if applicable) are write-locked, and that
> > their rbtrees are updated as needed.
>
> Surely vm_flags too...

Nah, there are a bunch of madvise() operations that change vm_flags,
and at least the simple ones don't touch rmap locks (I don't know if
maybe any of the more complex ones do). See MADV_DONTFORK, for example
- we basically just take the mmap lock in write mode, write-lock the
VMA, and overwrite the flags.

Not even do_mprotect_pkey() takes rmap locks! Just takes the mmap lock
in write mode, write-locks the VMA, changes the VM flags, and then
fixes up all the existing PTEs.

> > > +     -
> > > +     - mmap write, VMA write
> > > +   * - vm_file
> > > +     -
> > > +     - If the VMA is file-backed, points to a `struct file` object describing
> > > +       the underlying file, if anonymous then `NULL`.
> > > +     -
> > > +     - (Static)
> > > +   * - vm_private_data
> > > +     -
> > > +     - A `void *` field for driver-specific metadata.
> > > +     -
> > > +     - Driver-mandated.
> > > +   * - anon_name
> > > +     - anon name
> > > +     - A field for storing a `struct anon_vma_name` object providing a name for
> > > +       anonymous mappings, or `NULL` if none is set or the VMA is file-backed.
> > > +     -
> > > +     - mmap write, VMA write
> > > +   * - swap_readahead_info
> > > +     - swap
> > > +     - Metadata used by the swap mechanism to perform readahead.
> > > +     -
> > > +     - mmap read
> > > +   * - vm_region
> > > +     - nommu
> > > +     - The containing region for the VMA for architectures which do not
> > > +       possess an MMU.
> > > +     - N/A
> > > +     - N/A
> > > +   * - vm_policy
> > > +     - numa
> > > +     - `mempolicy` object which describes NUMA behaviour of the VMA.
> > > +     -
> > > +     - mmap write, VMA write
> > > +   * - numab_state
> > > +     - numab
> > > +     - `vma_numab_state` object which describes the current state of NUMA
> > > +       balancing in relation to this VMA.
> > > +     -
> > > +     - mmap write, VMA write
> >
> > I think task_numa_work() is only holding the mmap lock in read mode
> > when it sets this pointer to a non-NULL value.
>
> ugh lord... knew I'd get at least one of these wrong :P

to be fair I think task_numa_work() looks kinda dodgy ^^ I remember
spending quite a bit of time staring at it at one point (my very
sparse notes suggest I was looking in that area because I was
surprised that change_protection() can run with the mmap lock only
read-locked for some NUMA hinting fault stuff); I don't remember
whether I concluded that the ->vma_numab_state locking in
task_numa_work() is fine or just not overly terrible...

> > > +   * - vm_userfaultfd_ctx
> > > +     -
> > > +     - Userfaultfd context wrapper object of type `vm_userfaultfd_ctx`, either
> > > +       of zero size if userfaultfd is disabled, or containing a pointer to an
> > > +       underlying `userfaultfd_ctx` object which describes userfaultfd metadata.
> > > +     -
> > > +     - mmap write, VMA write
> > > +
> > > +.. note::
> > > +
> > > +   In the config column 'vma lock' configuration means CONFIG_PER_VMA_LOCK,
> > > +   'anon name' means CONFIG_ANON_VMA_NAME, 'swap' means CONFIG_SWAP, 'nommu'
> > > +   means that CONFIG_MMU is not set, 'numa' means CONFIG_NUMA and 'numab' means
> > > +   CONFIG_NUMA_BALANCING'.
> > > +
> > > +   In the write lock column '(Static)' means that the field is set only once
> > > +   upon initialisation of the VMA and not changed after this, the VMA would
> > > +   either have been under an mmap write and VMA write lock at the time or not
> > > +   yet inserted into any tree.
> > > +
> > > +Page table locks
> > > +----------------
> > > +
> > > +When allocating a P4D, PUD or PMD and setting the relevant entry in the above
> > > +PGD, P4D or PUD, the `mm->page_table_lock` is acquired to do so. This is
> > > +acquired in `__p4d_alloc()`, `__pud_alloc()` and `__pmd_alloc()` respectively.
> > > +
> > > +.. note::
> > > +   `__pmd_alloc()` actually invokes `pud_lock()` and `pud_lockptr()` in turn,
> > > +   however at the time of writing it ultimately references the
> > > +   `mm->page_table_lock`.
> > > +
> > > +Allocating a PTE will either use the `mm->page_table_lock` or, if
> > > +`USE_SPLIT_PMD_PTLOCKS` is defined, used a lock embedded in the PMD physical
> > > +page metadata in the form of a `struct ptdesc`, acquired by `pmd_ptdesc()`
> > > +called from `pmd_lock()` and ultimately `__pte_alloc()`.
> > >+
> > > +Finally, modifying the contents of the PTE has special treatment, as this is a
> > > +lock that we must acquire whenever we want stable and exclusive access to
> > > +entries pointing to data pages within a PTE, especially when we wish to modify
> > > +them.
> >
> > I guess one other perspective on this would be to focus on the
> > circumstances under which you're allowed to write entries:
> >
> > 0. page tables can be concurrently read by hardware and GUP-fast, so
> > writes must always be appropriately atomic
>
> Yeah I definitely need to mention GUP-fast considerations (and consequently
> the pXX_lockless..() functions). Thanks for raising that,  important one.
>
> > 1. changing a page table entry always requires locking the containing
> > page table (except when the write is an A/D update by hardware)
>
> I think we can ignore the hardware writes altogether, though I think worth
> adding a 'note' to explain this can happen outside of this framework
> altogether.

I think it's important to know about the existence of hardware writes
because it means you need atomic operations when making changes to
page tables. Like, for example, in many cases when changing a present
PTE, you can't even use READ_ONCE()/WRITE_ONCE() for PTEs and need
atomic RMW operations instead - see for example ptep_get_and_clear(),
which is basically implemented in arch code as an atomic xchg so that
it can't miss concurrent A/D bit updates.

(The non-SMP version of that on X86 doesn't use atomics, I have no
idea if that is actually correct or just mostly-working. Either way, I
guess the !SMP build doesn't matter very much nowadays.)

> > 2. in page tables higher than PMD level, page table entries that point
> > to page tables can only be changed to point to something else when
> > holding all the relevant high-level locks leading to the VMA in
> > exclusive mode: mmap lock (unless the VMA is detached), VMA lock,
> > anon_vma, address_space
>
> Right this seems mremap()-specific when you say 'change' here :) and of
> course, we have code that explicitly does this (take_rmap_locks() +
> drop_rmap_locks()).

munmap and mremap, yes. Basically what I'm trying to express with this
is "as a reader, you can assume that higher page tables are stable
just by having some kind of read lock on the VMA or its rmaps".

(IIRC back when this was the rule for all page table levels,
khugepaged used to do this too, write-locking both the rmap and the
mm.)

> > 3. PMD entries that point to page tables can be changed while holding
> > the page table spinlocks for the entry and the table it points to
>
> Hm wut? When you say 'entry' what do you mean? Obviously a page table in

By "PMD entry" I mean a pmd_t (located in a Page Middle Directory),
and by "that point to page tables" I mean "that point to a PTE-level
page table".

In other words, from the reader perspective (as I think you already
mentioned further down):

Rule 2 means: From the perspective of a reader who is holding the VMA
lock in read mode, once you have seen that e.g. a PUD entry
(overlapping the VMA's virtual address region) points to a PMD page
table, you know that this PUD entry will keep pointing to that PMD
table.

Rule 3 means: From the perspective of a reader who is holding the VMA
lock in read mode, once you have seen that a PMD entry (overlapping
the VMA's virtual address region) points to a page table, you don't
know whether this PMD entry will keep pointing to the same page table
unless you're also holding a spinlock on either the PMD or the page
table (because khugepaged).

> theory could be changed at any point you don't have it locked and to be
> sure it hasn't you have to lock + check again.


> > 5. entries in "none" state can only be populated with leaf entries
> > while holding the mmap or vma lock (doing it through the rmap would be
> > bad because that could race with munmap() zapping data pages in the
> > region)
> > 6. leaf entries can be zapped (changed to "none") while holding any
> > one of mmap lock, vma lock, address_space lock, or anon_vma lock
>
> For both 5 and 6 - I'm not sure if we ever zap without holding the mmap
> lock do we?
>
> Unless you're including folio_mkclean() and pfn_mkclean_range()? I guess
> this is 'strike of the linux kernel terminology' once again :P
>
> Yeah in that case sure.

There are a bunch of paths that zap without taking the mmap lock, the
easiest to reach is probably the ftruncate() syscall:

do_sys_ftruncate -> do_ftruncate -> do_truncate -> notify_change ->
simple_setattr -> truncate_setsize -> truncate_pagecache ->
unmap_mapping_range -> unmap_mapping_pages -> unmap_mapping_range_tree
-> {loop over file rmap tree} -> unmap_mapping_range_vma ->
zap_page_range_single

GPU drivers and such do it too, search for "unmap_mapping_range".

But you're right, I was being imprecise - as you pointed out, it's not
just used for zapping. Maybe the right version of 6 is something like:

    Leaf entries that are not in "none" state can
    be changed while holding any one of [...].

Though I'm not sure if that is overly broad - I think in practice the
changes we make under the rmap locks are something like the following,
though that might be missing some:

 - zapping leaf entries
 - zapping PMD entries pointing to page tables
 - clearing A/D bits
 - migration

> OK so interestingly this really aligns with what Alice said as to this not
> giving a clear indicator from a user's perspective as to 'what lock do I
> need to hold'.
>
> So I will absolutely address all this and try to get the fundamentals
> boiled down.
>
> Also obviously the exception to your rules are - _freeing_ of higher level
> page tables because we assume we are in a state where nothing can access
> them so no such locks are required. But I cover that below.
>
> >
> > And then the rules for readers mostly follow from that:
> > 1 => holding the appropriate page table lock makes the contents of a
> > page table stable, except for A/D updates
> > 2 => page table entries higher than PMD level that point to lower page
> > tables can be followed without taking page table locks
>
> Yeah this is true actually, might be worth mentioning page table walkers
> here and how they operate as they're instructive on page table locking
> requirements.
>
> > 3+4 => following PMD entries pointing to page tables requires careful
> > locking, and pte_offset_map_lock() does that for you
>
> Well, pte_offset_map_lock() is obtained at the PTE level right?

pte_offset_map_lock() is given a pointer to a PMD entry, and it
follows the PMD entry to a PTE-level page table. My point here is that
you can't just simply start a "lock this PTE-level page table"
operation at the PTE level because by the time you've locked the page
table, the PMD entry may have changed, and the page table you just
locked may be empty and doomed to be deleted after RCU delay. So you
have to use __pte_offset_map_lock(), which takes a pointer to a PMD
entry, and in a loop, looks up the page table from the PMD entry,
locks the referenced page table, rechecks that the PMD entry still
points to the locked page table, and if not, retries all these steps
until it manages to lock a stable page table.

> pmd_lock() at the PMD level (pud_lock() ostensibly at PUD level but this
> amounts to an mm->page_table_lock anyway there)

> > I think something like
> > https://www.kernel.org/doc/html/latest/doc-guide/kernel-doc.html#overview-documentation-comments
> > is supposed to let you include the current version of the comment into
> > the rendered documentation HTML without having to manually keep things
> > in sync. I've never used that myself, but there are a bunch of
> > examples in the tree; for example, grep for "DMA fences overview".
>
> Ah, but this isn't a kernel doc is just a raw comment :) so I'm not sure there
> is a great way of grabbing just that, reliably. Maybe can turn that into a
> kernel doc comment in a follow up patch or something?

Ah, yeah, sounds reasonable.
Lorenzo Stoakes Nov. 5, 2024, 12:23 p.m. UTC | #19
On Mon, Nov 04, 2024 at 05:19:21PM +0200, Mike Rapoport wrote:
> On Mon, Nov 04, 2024 at 02:17:36PM +0000, Lorenzo Stoakes wrote:
> > On Sat, Nov 02, 2024 at 11:00:20AM +0200, Mike Rapoport wrote:
> > > > +
> > > > +The table specifies which write locks must be held to write to the field.
> > > > +
> > > > +.. list-table::
> > > > +   :widths: 20 10 22 5 20
> > > > +   :header-rows: 1
> > >
> > > And use .. table here as well, e.g
> >
> > Hm this one is a little less clearly worth it because not only will that take me
> > ages but it'll be quite difficult to read in a sensible editor. I can if you
> > insist though?
>
> With spaces it will look just fine in a text editor and IMHO better than
> list-table, but I don't insist.
>
> > > .. table::
> > >
> > >     ======== ======== ========================== ================== ==========
> > >     Field    Config   Description                Unstable RCU read? Write lock
> > >     ======== ======== ========================== ================== ==========
> > >     vm_start          Inclusive start virtual                       mmap write,
> > >                       address of range VMA                          VMA write
> > >                       describes
> > >
> > >     vm_end            Exclusive end virtual                         mmap write,
> > >                       address of range VMA                          VMA write
> > >                       describes
> > >
> > >     vm_rcu   vma_lock RCU list head, in union    N/A                N/A
> > >                       with vma_start, vma_end.
> > >                       RCU implementation detail
> > >     ======== ======== ========================== ================== ==========
>
> --
> Sincerely yours,
> Mike.

Since it's you Mike I'll do it ;) unless it turns out obviously
awful. Should probably be fine! :)
Alice Ryhl Nov. 5, 2024, 1:56 p.m. UTC | #20
On Mon, Nov 4, 2024 at 5:52 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> +cc Suren, linux-doc who I mistakenly didn't cc in first email!
>
> On Mon, Nov 04, 2024 at 03:47:56PM +0100, Alice Ryhl wrote:
> > On Fri, Nov 1, 2024 at 7:50 PM Lorenzo Stoakes
> > <lorenzo.stoakes@oracle.com> wrote:
> > >
> > > Locking around VMAs is complicated and confusing. While we have a number of
> > > disparate comments scattered around the place, we seem to be reaching a
> > > level of complexity that justifies a serious effort at clearly documenting
> > > how locks are expected to be interacted with when it comes to interacting
> > > with mm_struct and vm_area_struct objects.
> > >
> > > This is especially pertinent as regards efforts to find sensible
> > > abstractions for these fundamental objects within the kernel rust
> > > abstraction whose compiler strictly requires some means of expressing these
> > > rules (and through this expression can help self-document these
> > > requirements as well as enforce them which is an exciting concept).
> > >
> > > The document limits scope to mmap and VMA locks and those that are
> > > immediately adjacent and relevant to them - so additionally covers page
> > > table locking as this is so very closely tied to VMA operations (and relies
> > > upon us handling these correctly).
> > >
> > > The document tries to cover some of the nastier and more confusing edge
> > > cases and concerns especially around lock ordering and page table teardown.
> > >
> > > The document also provides some VMA lock internals, which are up to date
> > > and inclusive of recent changes to recent sequence number changes.
> > >
> > > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> >
> > [...]
> >
> > > +Page table locks
> > > +----------------
> > > +
> > > +When allocating a P4D, PUD or PMD and setting the relevant entry in the above
> > > +PGD, P4D or PUD, the `mm->page_table_lock` is acquired to do so. This is
> > > +acquired in `__p4d_alloc()`, `__pud_alloc()` and `__pmd_alloc()` respectively.
> > > +
> > > +.. note::
> > > +   `__pmd_alloc()` actually invokes `pud_lock()` and `pud_lockptr()` in turn,
> > > +   however at the time of writing it ultimately references the
> > > +   `mm->page_table_lock`.
> > > +
> > > +Allocating a PTE will either use the `mm->page_table_lock` or, if
> > > +`USE_SPLIT_PMD_PTLOCKS` is defined, used a lock embedded in the PMD physical
> > > +page metadata in the form of a `struct ptdesc`, acquired by `pmd_ptdesc()`
> > > +called from `pmd_lock()` and ultimately `__pte_alloc()`.
> > > +
> > > +Finally, modifying the contents of the PTE has special treatment, as this is a
> > > +lock that we must acquire whenever we want stable and exclusive access to
> > > +entries pointing to data pages within a PTE, especially when we wish to modify
> > > +them.
> > > +
> > > +This is performed via `pte_offset_map_lock()` which carefully checks to ensure
> > > +that the PTE hasn't changed from under us, ultimately invoking `pte_lockptr()`
> > > +to obtain a spin lock at PTE granularity contained within the `struct ptdesc`
> > > +associated with the physical PTE page. The lock must be released via
> > > +`pte_unmap_unlock()`.
> > > +
> > > +.. note::
> > > +   There are some variants on this, such as `pte_offset_map_rw_nolock()` when we
> > > +   know we hold the PTE stable but for brevity we do not explore this.
> > > +   See the comment for `__pte_offset_map_lock()` for more details.
> > > +
> > > +When modifying data in ranges we typically only wish to allocate higher page
> > > +tables as necessary, using these locks to avoid races or overwriting anything,
> > > +and set/clear data at the PTE level as required (for instance when page faulting
> > > +or zapping).
> >
> > Speaking as someone who doesn't know the internals at all ... this
> > section doesn't really answer any questions I have about the page
> > table. It looks like this could use an initial section about basic
> > usage, and the detailed information could come after? Concretely, if I
> > wish to call vm_insert_page or zap some pages, what are the locking
> > requirements? What if I'm writing a page fault handler?
>
> Ack totally agree, I think we need this document to serve two purposes -
> one is to go over, in detail, the locking requirements from an mm dev's
> point of view with internals focus, and secondly to give those outside mm
> this kind of information.
>
> It's good to get insight from an outside perspective as inevitably we mm
> devs lose sight of the wood for the trees when it comes to internals
> vs. practical needs of those who make use of mm in one respect or another.
>
> So this kind of feedback is very helpful and welcome :) TL;DR - yes I will
> explicitly state what is required for various operations on the respin.
>
> >
> > Alice
>
> As a wordy aside, a large part of the motivation of this document, or
> certainly my prioritisation of it, is explicitly to help the rust team
> correctly abstract this aspect of mm.
>
> The other part is to help the mm team, that is especailly myself, correctly
> understand and _remember_ the numerous painful ins and outs of this stuff,
> much of which has been pertinent of late for not wonderfully positive
> reasons.
>
> Hopefully we accomplish both! :>)

I do think this has revealed one issue with my Rust patch, which is
that VmAreaMut currently requires the mmap lock, but it should also
require the vma lock, since you need both for writing.

Alice
Lorenzo Stoakes Nov. 5, 2024, 2:18 p.m. UTC | #21
On Tue, Nov 05, 2024 at 02:56:43PM +0100, Alice Ryhl wrote:
[snip]
> >
> > As a wordy aside, a large part of the motivation of this document, or
> > certainly my prioritisation of it, is explicitly to help the rust team
> > correctly abstract this aspect of mm.
> >
> > The other part is to help the mm team, that is especailly myself, correctly
> > understand and _remember_ the numerous painful ins and outs of this stuff,
> > much of which has been pertinent of late for not wonderfully positive
> > reasons.
> >
> > Hopefully we accomplish both! :>)
>
> I do think this has revealed one issue with my Rust patch, which is
> that VmAreaMut currently requires the mmap lock, but it should also
> require the vma lock, since you need both for writing.
>
> Alice

Awesome :) I am genuinely hoping this doc will aid you guys in rust in
general.

And maybe I can use advent of code this year to actually learn the language
myself...

I nominate Vlasta to join me ;)
Lorenzo Stoakes Nov. 5, 2024, 4:10 p.m. UTC | #22
On Mon, Nov 04, 2024 at 10:29:47PM +0100, Jann Horn wrote:
> On Mon, Nov 4, 2024 at 5:42 PM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> > On Sat, Nov 02, 2024 at 02:45:35AM +0100, Jann Horn wrote:
> > > On Fri, Nov 1, 2024 at 7:50 PM Lorenzo Stoakes
> > > <lorenzo.stoakes@oracle.com> wrote:
> > > > Locking around VMAs is complicated and confusing. While we have a number of
> > > > disparate comments scattered around the place, we seem to be reaching a
> > > > level of complexity that justifies a serious effort at clearly documenting
> > > > how locks are expected to be interacted with when it comes to interacting
> > > > with mm_struct and vm_area_struct objects.
> > > >
> > > > This is especially pertinent as regards efforts to find sensible
> > > > abstractions for these fundamental objects within the kernel rust
> > > > abstraction whose compiler strictly requires some means of expressing these
> > > > rules (and through this expression can help self-document these
> > > > requirements as well as enforce them which is an exciting concept).
> > > >
> > > > The document limits scope to mmap and VMA locks and those that are
> > > > immediately adjacent and relevant to them - so additionally covers page
> > > > table locking as this is so very closely tied to VMA operations (and relies
> > > > upon us handling these correctly).
> > > >
> > > > The document tries to cover some of the nastier and more confusing edge
> > > > cases and concerns especially around lock ordering and page table teardown.
> > > >
> > > > The document also provides some VMA lock internals, which are up to date
> > > > and inclusive of recent changes to recent sequence number changes.
>
> > > > +Overview
> > > > +--------
> > > > +
> > > > +Userland memory ranges are tracked by the kernel via Virtual Memory Areas or
> > > > +'VMA's of type `struct vm_area_struct`.
> > > > +
> > > > +Each VMA describes a virtually contiguous memory range with identical
> > > > +attributes, each of which described by a `struct vm_area_struct`
> > > > +object. Userland access outside of VMAs is invalid except in the case where an
> > > > +adjacent stack VMA could be extended to contain the accessed address.
> > > > +
> > > > +All VMAs are contained within one and only one virtual address space, described
> > > > +by a `struct mm_struct` object which is referenced by all tasks (that is,
> > > > +threads) which share the virtual address space. We refer to this as the `mm`.
> > > > +
> > > > +Each mm object contains a maple tree data structure which describes all VMAs
> > > > +within the virtual address space.
> > >
> > > The gate VMA is special, on architectures that have it: Userland
> > > access to its area is allowed, but the area is outside the VA range
> > > managed by the normal MM code, and the gate VMA is a global object
> > > (not per-MM), and only a few places in MM code can interact with it
> > > (for example, page fault handling can't, but GUP can through
> > > get_gate_page()).
> > >
> > > (I think this also has the fun consequence that vm_normal_page() can
> > > get called on a VMA whose ->vm_mm is NULL, when called from
> > > get_gate_page().)
> >
> > Yeah the gate page is weird, I'm not sure it's worth going into too much detail
> > here, but perhaps a note explaining in effect 'except for the gate page..'
> > unless you think it'd be valuable to go into that in more detail than a passing
> > 'hey of course there's an exception to this!' comment? :)
>
> Yeah I think that's good enough.

Thanks!

>
> > > > +The kernel is designed to be highly scalable against concurrent access to
> > > > +userland memory, so a complicated set of locks are required to ensure no data
> > > > +races or memory corruption occurs.
> > > > +
> > > > +This document explores this locking in detail.
> > > > +
> > > > +.. note::
> > > > +
> > > > +   There are three different things that a user might want to achieve via
> > > > +   locks - the first of which is **stability**. That is - ensuring that the VMA
> > > > +   won't be freed or modified in any way from underneath us.
> > > > +
> > > > +   All MM and VMA locks ensure stability.
> > > > +
> > > > +   Secondly we have locks which allow **reads** but not writes (and which might
> > > > +   be held concurrent with other CPUs who also hold the read lock).
>
> (maybe also note more clearly here that "read" is talking about the
> VMA metadata, so an operation that writes page table entries normally
> counts as "read")

Yeah good point will add a clarification. This aligns with a point Mike raised
about being clear the read/write is not in reference to user data.

>
> > > > +   Finally, we have locks which permit exclusive access to the VMA to allow for
> > > > +   **writes** to the VMA.
> > >
> > > Maybe also mention that there are three major paths you can follow to
> > > reach a VMA? You can come through the mm's VMA tree, you can do an
> > > anon page rmap walk that goes page -> anon_vma -> vma, or you can do a
> > > file rmap walk from the address_space. Which is why just holding the
> > > mmap lock and vma lock in write mode is not enough to permit arbitrary
> > > changes to a VMA struct.
> >
> > I totally agree that adding something about _where_ you can come from is a good
> > idea, will do.
> >
> > However, in terms of the VMA itself, mmap lock and vma lock _are_ sufficient to
> > prevent arbitrary _changes_ to the VMA struct right?
>
> Yes. But the sentence "Finally, we have locks which permit exclusive
> access to the VMA to allow for **writes** to the VMA" kinda sounds as
> if there is a single lock you can take that allows you to write to the
> VMA struct.
>
> > It isn't sufficient to prevent _reading_ of vma metadata fields, nor walking of
> > underlying page tables, so if you're going to do something that changes
> > fundamentals you need to hide it from rmap.
> >
> > Maybe worth going over relevant fields? Or rather adding an additional 'read
> > lock' column?
> >
> > vma->vm_mm ('static' anyway after VMA created)
> > vma->vm_start (change on merge/split)
>
> and on stack expansion :P
> But I guess nowadays that's basically semantically just a special case
> of merge, so no need to explicitly mention it here...

Well actually no... hm I wonder if we should make that call vma_expand()... but
that's another thing :) will clarify this anyway.

>
> > vma->vm_end (change on merge/split)
> > vma->vm_flags (can change)
> > vma->vm_ops ('static' anyway after call_mmap())
> >
> > In any case this is absolutely _crucial_ I agree, will add.
> >
> > >
> > > > +MM and VMA locks
> > > > +----------------
> > > > +
> > > > +There are two key classes of lock utilised when reading and manipulating VMAs -
> > > > +the `mmap_lock` which is a read/write semaphore maintained at the `mm_struct`
> > > > +level of granularity and, if CONFIG_PER_VMA_LOCK is set, a per-VMA lock at the
> > > > +VMA level of granularity.
> > > > +
> > > > +.. note::
> > > > +
> > > > +   Generally speaking, a read/write semaphore is a class of lock which permits
> > > > +   concurrent readers. However a write lock can only be obtained once all
> > > > +   readers have left the critical region (and pending readers made to wait).
> > > > +
> > > > +   This renders read locks on a read/write semaphore concurrent with other
> > > > +   readers and write locks exclusive against all others holding the semaphore.
> > > > +
> > > > +If CONFIG_PER_VMA_LOCK is not set, then things are relatively simple - a write
> > > > +mmap lock gives you exclusive write access to a VMA, and a read lock gives you
> > > > +concurrent read-only access.
> > > > +
> > > > +In the presence of CONFIG_PER_VMA_LOCK, i.e. VMA locks, things are more
> > > > +complicated. In this instance, a write semaphore is no longer enough to gain
> > > > +exclusive access to a VMA, a VMA write lock is also required.
> > > > +
> > > > +The VMA lock is implemented via the use of both a read/write semaphore and
> > > > +per-VMA and per-mm sequence numbers. We go into detail on this in the VMA lock
> > > > +internals section below, so for the time being it is important only to note that
> > > > +we can obtain either a VMA read or write lock.
> > > > +
> > > > +.. note::
> > > > +
> > > > +   VMAs under VMA **read** lock are obtained by the `lock_vma_under_rcu()`
> > > > +   function, and **no** existing mmap or VMA lock must be held, This function
> > >
> > > uffd_move_lock() calls lock_vma_under_rcu() after having already
> > > VMA-locked another VMA with uffd_lock_vma().
> >
> > Oh uffd, how we love you...
> >
> > I think it might be worth adding a note for this exception. Obviously they do
> > some pretty careful manipulation to avoid issues here so probably worth saying
> > 'hey except uffd'
>
> I guess another way to write it would be something like:
>
> "Taking the mmap lock in read mode while you're holding a vma lock is
> forbidden because it can deadlock. Calling lock_vma_under_rcu()
> normally only makes sense when you're not holding the mmap lock
> (otherwise it would be redundant). lock_vma_under_rcu() has trylock
> semantics, and if it fails you need a plan B (which normally is to
> take the mmap lock in read mode instead; notably this would get more
> annoying if you were already holding another VMA lock, because then
> you'd have to drop that first)."?

Will extract some stuff from here and add in.

>
> > > > +   lock, page faults can race against you even if you hold an mmap write lock.
> > > > +
> > > > +VMA Fields
> > > > +----------
> > > > +
> > > > +We examine each field of the `struct vm_area_struct` type in detail in the table
> > > > +below.
> > > > +
> > > > +Reading of each field requires either an mmap read lock or a VMA read lock to be
> > > > +held, except where 'unstable RCU read' is specified, in which case unstable
> > > > +access to the field is permitted under RCU alone.
> > > > +
> > > > +The table specifies which write locks must be held to write to the field.
> > >
> > > vm_start, vm_end and vm_pgoff also require that the associated
> > > address_space and anon_vma (if applicable) are write-locked, and that
> > > their rbtrees are updated as needed.
> >
> > Surely vm_flags too...
>
> Nah, there are a bunch of madvise() operations that change vm_flags,
> and at least the simple ones don't touch rmap locks (I don't know if
> maybe any of the more complex ones do). See MADV_DONTFORK, for example
> - we basically just take the mmap lock in write mode, write-lock the
> VMA, and overwrite the flags.
>
> Not even do_mprotect_pkey() takes rmap locks! Just takes the mmap lock
> in write mode, write-locks the VMA, changes the VM flags, and then
> fixes up all the existing PTEs.

Sure was aware of course about the madvise() cases (which are, thankfully,
always mmap locked), just wondered if there was a need to be careful about flags
in the same way but you're right... :)

>
> > > > +     -
> > > > +     - mmap write, VMA write
> > > > +   * - vm_file
> > > > +     -
> > > > +     - If the VMA is file-backed, points to a `struct file` object describing
> > > > +       the underlying file, if anonymous then `NULL`.
> > > > +     -
> > > > +     - (Static)
> > > > +   * - vm_private_data
> > > > +     -
> > > > +     - A `void *` field for driver-specific metadata.
> > > > +     -
> > > > +     - Driver-mandated.
> > > > +   * - anon_name
> > > > +     - anon name
> > > > +     - A field for storing a `struct anon_vma_name` object providing a name for
> > > > +       anonymous mappings, or `NULL` if none is set or the VMA is file-backed.
> > > > +     -
> > > > +     - mmap write, VMA write
> > > > +   * - swap_readahead_info
> > > > +     - swap
> > > > +     - Metadata used by the swap mechanism to perform readahead.
> > > > +     -
> > > > +     - mmap read
> > > > +   * - vm_region
> > > > +     - nommu
> > > > +     - The containing region for the VMA for architectures which do not
> > > > +       possess an MMU.
> > > > +     - N/A
> > > > +     - N/A
> > > > +   * - vm_policy
> > > > +     - numa
> > > > +     - `mempolicy` object which describes NUMA behaviour of the VMA.
> > > > +     -
> > > > +     - mmap write, VMA write
> > > > +   * - numab_state
> > > > +     - numab
> > > > +     - `vma_numab_state` object which describes the current state of NUMA

> > > > +       balancing in relation to this VMA.
> > > > +     -
> > > > +     - mmap write, VMA write
> > >
> > > I think task_numa_work() is only holding the mmap lock in read mode
> > > when it sets this pointer to a non-NULL value.
> >
> > ugh lord... knew I'd get at least one of these wrong :P
>
> to be fair I think task_numa_work() looks kinda dodgy ^^ I remember
> spending quite a bit of time staring at it at one point (my very
> sparse notes suggest I was looking in that area because I was
> surprised that change_protection() can run with the mmap lock only
> read-locked for some NUMA hinting fault stuff); I don't remember
> whether I concluded that the ->vma_numab_state locking in
> task_numa_work() is fine or just not overly terrible...

Yeah I guess we are documenting _what is_ rather than _what is sane_ anyway :P

>
> > > > +   * - vm_userfaultfd_ctx
> > > > +     -
> > > > +     - Userfaultfd context wrapper object of type `vm_userfaultfd_ctx`, either
> > > > +       of zero size if userfaultfd is disabled, or containing a pointer to an
> > > > +       underlying `userfaultfd_ctx` object which describes userfaultfd metadata.
> > > > +     -
> > > > +     - mmap write, VMA write
> > > > +
> > > > +.. note::
> > > > +
> > > > +   In the config column 'vma lock' configuration means CONFIG_PER_VMA_LOCK,
> > > > +   'anon name' means CONFIG_ANON_VMA_NAME, 'swap' means CONFIG_SWAP, 'nommu'
> > > > +   means that CONFIG_MMU is not set, 'numa' means CONFIG_NUMA and 'numab' means
> > > > +   CONFIG_NUMA_BALANCING'.
> > > > +
> > > > +   In the write lock column '(Static)' means that the field is set only once
> > > > +   upon initialisation of the VMA and not changed after this, the VMA would
> > > > +   either have been under an mmap write and VMA write lock at the time or not
> > > > +   yet inserted into any tree.
> > > > +
> > > > +Page table locks
> > > > +----------------
> > > > +
> > > > +When allocating a P4D, PUD or PMD and setting the relevant entry in the above
> > > > +PGD, P4D or PUD, the `mm->page_table_lock` is acquired to do so. This is
> > > > +acquired in `__p4d_alloc()`, `__pud_alloc()` and `__pmd_alloc()` respectively.
> > > > +
> > > > +.. note::
> > > > +   `__pmd_alloc()` actually invokes `pud_lock()` and `pud_lockptr()` in turn,
> > > > +   however at the time of writing it ultimately references the
> > > > +   `mm->page_table_lock`.
> > > > +
> > > > +Allocating a PTE will either use the `mm->page_table_lock` or, if
> > > > +`USE_SPLIT_PMD_PTLOCKS` is defined, used a lock embedded in the PMD physical
> > > > +page metadata in the form of a `struct ptdesc`, acquired by `pmd_ptdesc()`
> > > > +called from `pmd_lock()` and ultimately `__pte_alloc()`.
> > > >+
> > > > +Finally, modifying the contents of the PTE has special treatment, as this is a
> > > > +lock that we must acquire whenever we want stable and exclusive access to
> > > > +entries pointing to data pages within a PTE, especially when we wish to modify
> > > > +them.
> > >
> > > I guess one other perspective on this would be to focus on the
> > > circumstances under which you're allowed to write entries:
> > >
> > > 0. page tables can be concurrently read by hardware and GUP-fast, so
> > > writes must always be appropriately atomic
> >
> > Yeah I definitely need to mention GUP-fast considerations (and consequently
> > the pXX_lockless..() functions). Thanks for raising that,  important one.
> >
> > > 1. changing a page table entry always requires locking the containing
> > > page table (except when the write is an A/D update by hardware)
> >
> > I think we can ignore the hardware writes altogether, though I think worth
> > adding a 'note' to explain this can happen outside of this framework
> > altogether.
>
> I think it's important to know about the existence of hardware writes
> because it means you need atomic operations when making changes to
> page tables. Like, for example, in many cases when changing a present
> PTE, you can't even use READ_ONCE()/WRITE_ONCE() for PTEs and need
> atomic RMW operations instead - see for example ptep_get_and_clear(),
> which is basically implemented in arch code as an atomic xchg so that
> it can't miss concurrent A/D bit updates.
>
> (The non-SMP version of that on X86 doesn't use atomics, I have no
> idea if that is actually correct or just mostly-working. Either way, I
> guess the !SMP build doesn't matter very much nowadays.)

Ack will document.

>
> > > 2. in page tables higher than PMD level, page table entries that point
> > > to page tables can only be changed to point to something else when
> > > holding all the relevant high-level locks leading to the VMA in
> > > exclusive mode: mmap lock (unless the VMA is detached), VMA lock,
> > > anon_vma, address_space
> >
> > Right this seems mremap()-specific when you say 'change' here :) and of
> > course, we have code that explicitly does this (take_rmap_locks() +
> > drop_rmap_locks()).
>
> munmap and mremap, yes. Basically what I'm trying to express with this
> is "as a reader, you can assume that higher page tables are stable
> just by having some kind of read lock on the VMA or its rmaps".
>
> (IIRC back when this was the rule for all page table levels,
> khugepaged used to do this too, write-locking both the rmap and the
> mm.)

I feel that talking about stability in the page table section is a good
idea also then.

>
> > > 3. PMD entries that point to page tables can be changed while holding
> > > the page table spinlocks for the entry and the table it points to
> >
> > Hm wut? When you say 'entry' what do you mean? Obviously a page table in
>
> By "PMD entry" I mean a pmd_t (located in a Page Middle Directory),
> and by "that point to page tables" I mean "that point to a PTE-level
> page table".
>
> In other words, from the reader perspective (as I think you already
> mentioned further down):
>
> Rule 2 means: From the perspective of a reader who is holding the VMA
> lock in read mode, once you have seen that e.g. a PUD entry
> (overlapping the VMA's virtual address region) points to a PMD page
> table, you know that this PUD entry will keep pointing to that PMD
> table.
>
> Rule 3 means: From the perspective of a reader who is holding the VMA
> lock in read mode, once you have seen that a PMD entry (overlapping
> the VMA's virtual address region) points to a page table, you don't
> know whether this PMD entry will keep pointing to the same page table
> unless you're also holding a spinlock on either the PMD or the page
> table (because khugepaged).

Thanks right I see what you mean.

Might be worth having an explicit THP (thus khugepaged) section? And
perhaps even KSM...

>
> > theory could be changed at any point you don't have it locked and to be
> > sure it hasn't you have to lock + check again.
>
>
> > > 5. entries in "none" state can only be populated with leaf entries
> > > while holding the mmap or vma lock (doing it through the rmap would be
> > > bad because that could race with munmap() zapping data pages in the
> > > region)
> > > 6. leaf entries can be zapped (changed to "none") while holding any
> > > one of mmap lock, vma lock, address_space lock, or anon_vma lock
> >
> > For both 5 and 6 - I'm not sure if we ever zap without holding the mmap
> > lock do we?
> >
> > Unless you're including folio_mkclean() and pfn_mkclean_range()? I guess
> > this is 'strike of the linux kernel terminology' once again :P
> >
> > Yeah in that case sure.
>
> There are a bunch of paths that zap without taking the mmap lock, the
> easiest to reach is probably the ftruncate() syscall:
>
> do_sys_ftruncate -> do_ftruncate -> do_truncate -> notify_change ->
> simple_setattr -> truncate_setsize -> truncate_pagecache ->
> unmap_mapping_range -> unmap_mapping_pages -> unmap_mapping_range_tree
> -> {loop over file rmap tree} -> unmap_mapping_range_vma ->
> zap_page_range_single
>
> GPU drivers and such do it too, search for "unmap_mapping_range".

Yeah sorry I missed unmap_mapping_range(), I don't love the naming around
this...

Might be worth spelling out these paths specifically actually.

>
> But you're right, I was being imprecise - as you pointed out, it's not
> just used for zapping. Maybe the right version of 6 is something like:
>
>     Leaf entries that are not in "none" state can
>     be changed while holding any one of [...].
>
> Though I'm not sure if that is overly broad - I think in practice the
> changes we make under the rmap locks are something like the following,
> though that might be missing some:
>
>  - zapping leaf entries
>  - zapping PMD entries pointing to page tables
>  - clearing A/D bits
>  - migration
>
> > OK so interestingly this really aligns with what Alice said as to this not
> > giving a clear indicator from a user's perspective as to 'what lock do I
> > need to hold'.
> >
> > So I will absolutely address all this and try to get the fundamentals
> > boiled down.
> >
> > Also obviously the exception to your rules are - _freeing_ of higher level
> > page tables because we assume we are in a state where nothing can access
> > them so no such locks are required. But I cover that below.
> >
> > >
> > > And then the rules for readers mostly follow from that:
> > > 1 => holding the appropriate page table lock makes the contents of a
> > > page table stable, except for A/D updates
> > > 2 => page table entries higher than PMD level that point to lower page
> > > tables can be followed without taking page table locks
> >
> > Yeah this is true actually, might be worth mentioning page table walkers
> > here and how they operate as they're instructive on page table locking
> > requirements.
> >
> > > 3+4 => following PMD entries pointing to page tables requires careful
> > > locking, and pte_offset_map_lock() does that for you
> >
> > Well, pte_offset_map_lock() is obtained at the PTE level right?
>
> pte_offset_map_lock() is given a pointer to a PMD entry, and it
> follows the PMD entry to a PTE-level page table. My point here is that
> you can't just simply start a "lock this PTE-level page table"
> operation at the PTE level because by the time you've locked the page
> table, the PMD entry may have changed, and the page table you just
> locked may be empty and doomed to be deleted after RCU delay. So you
> have to use __pte_offset_map_lock(), which takes a pointer to a PMD
> entry, and in a loop, looks up the page table from the PMD entry,
> locks the referenced page table, rechecks that the PMD entry still
> points to the locked page table, and if not, retries all these steps
> until it manages to lock a stable page table.

Right yeah, I mean this is kind of a standard pattern in the kernel though
like:

1. Grab some pointer to something
2. Lock
3. Really make sure it hasn't disappeared from under us
4. If so, unlock and try again
5. Otherwise proceed

You have this pattern with folios too...

But yeah maybe worth spelling this out.

>
> > pmd_lock() at the PMD level (pud_lock() ostensibly at PUD level but this
> > amounts to an mm->page_table_lock anyway there)
>
> > > I think something like
> > > https://www.kernel.org/doc/html/latest/doc-guide/kernel-doc.html#overview-documentation-comments
> > > is supposed to let you include the current version of the comment into
> > > the rendered documentation HTML without having to manually keep things
> > > in sync. I've never used that myself, but there are a bunch of
> > > examples in the tree; for example, grep for "DMA fences overview".
> >
> > Ah, but this isn't a kernel doc is just a raw comment :) so I'm not sure there
> > is a great way of grabbing just that, reliably. Maybe can turn that into a
> > kernel doc comment in a follow up patch or something?
>
> Ah, yeah, sounds reasonable.

Thanks.


I think all this makes me think that we should actually have entirely
separate top level descriptions and internals sections in this document,
which align's again with Alice's comments.

As the level of detail and caveats here mean that if you provide
implementation details everywhere you end up constantly on a tangent
(important, relevant internal details but to a _user_ of the functionality
not so important).
Lorenzo Stoakes Nov. 5, 2024, 4:11 p.m. UTC | #23
On Mon, Nov 04, 2024 at 01:20:14PM -0800, Suren Baghdasaryan wrote:
[snip]
> Sounds good! I think it would be good to separate design decisions (mm
> should be write-locked before any VMA can be write-locked, all
> write-locked VMAs and automatically unlocked once mm is
> write-unlocked, etc) vs implementation details (lock_vma_under_rcu(),
> vma->vm_lock_seq, mm->mm_lock_seq, etc). Easily said than done :)

Yeah, was saying the same to Jann - probably best to have an explicit separate
section for implementation details with the rest providing readable + pertinent
details for _users_ of these interfaces.
Jann Horn Nov. 5, 2024, 5:21 p.m. UTC | #24
On Tue, Nov 5, 2024 at 5:10 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
> On Mon, Nov 04, 2024 at 10:29:47PM +0100, Jann Horn wrote:
> > On Mon, Nov 4, 2024 at 5:42 PM Lorenzo Stoakes
> > <lorenzo.stoakes@oracle.com> wrote:
> > > On Sat, Nov 02, 2024 at 02:45:35AM +0100, Jann Horn wrote:
> > > > On Fri, Nov 1, 2024 at 7:50 PM Lorenzo Stoakes
> > > > <lorenzo.stoakes@oracle.com> wrote:
> > > > 3. PMD entries that point to page tables can be changed while holding
> > > > the page table spinlocks for the entry and the table it points to
> > >
> > > Hm wut? When you say 'entry' what do you mean? Obviously a page table in
> >
> > By "PMD entry" I mean a pmd_t (located in a Page Middle Directory),
> > and by "that point to page tables" I mean "that point to a PTE-level
> > page table".
> >
> > In other words, from the reader perspective (as I think you already
> > mentioned further down):
> >
> > Rule 2 means: From the perspective of a reader who is holding the VMA
> > lock in read mode, once you have seen that e.g. a PUD entry
> > (overlapping the VMA's virtual address region) points to a PMD page
> > table, you know that this PUD entry will keep pointing to that PMD
> > table.
> >
> > Rule 3 means: From the perspective of a reader who is holding the VMA
> > lock in read mode, once you have seen that a PMD entry (overlapping
> > the VMA's virtual address region) points to a page table, you don't
> > know whether this PMD entry will keep pointing to the same page table
> > unless you're also holding a spinlock on either the PMD or the page
> > table (because khugepaged).
>
> Thanks right I see what you mean.
>
> Might be worth having an explicit THP (thus khugepaged) section? And
> perhaps even KSM...

Maybe, yeah - I think it's important to roughly know what they do, but
I would still focus on what rules other parts of MM, or users of MM,
have to follow to not break in their interaction with things like THP
and KSM. So maybe kinda do it in the direction of "these are the rules
(and here is the detail of why we have those arbitrary-looking
rules)"?

> > But you're right, I was being imprecise - as you pointed out, it's not
> > just used for zapping. Maybe the right version of 6 is something like:
> >
> >     Leaf entries that are not in "none" state can
> >     be changed while holding any one of [...].
> >
> > Though I'm not sure if that is overly broad - I think in practice the
> > changes we make under the rmap locks are something like the following,
> > though that might be missing some:
> >
> >  - zapping leaf entries
> >  - zapping PMD entries pointing to page tables
> >  - clearing A/D bits
> >  - migration
> >
> > > OK so interestingly this really aligns with what Alice said as to this not
> > > giving a clear indicator from a user's perspective as to 'what lock do I
> > > need to hold'.
> > >
> > > So I will absolutely address all this and try to get the fundamentals
> > > boiled down.
> > >
> > > Also obviously the exception to your rules are - _freeing_ of higher level
> > > page tables because we assume we are in a state where nothing can access
> > > them so no such locks are required. But I cover that below.
> > >
> > > >
> > > > And then the rules for readers mostly follow from that:
> > > > 1 => holding the appropriate page table lock makes the contents of a
> > > > page table stable, except for A/D updates
> > > > 2 => page table entries higher than PMD level that point to lower page
> > > > tables can be followed without taking page table locks
> > >
> > > Yeah this is true actually, might be worth mentioning page table walkers
> > > here and how they operate as they're instructive on page table locking
> > > requirements.
> > >
> > > > 3+4 => following PMD entries pointing to page tables requires careful
> > > > locking, and pte_offset_map_lock() does that for you
> > >
> > > Well, pte_offset_map_lock() is obtained at the PTE level right?
> >
> > pte_offset_map_lock() is given a pointer to a PMD entry, and it
> > follows the PMD entry to a PTE-level page table. My point here is that
> > you can't just simply start a "lock this PTE-level page table"
> > operation at the PTE level because by the time you've locked the page
> > table, the PMD entry may have changed, and the page table you just
> > locked may be empty and doomed to be deleted after RCU delay. So you
> > have to use __pte_offset_map_lock(), which takes a pointer to a PMD
> > entry, and in a loop, looks up the page table from the PMD entry,
> > locks the referenced page table, rechecks that the PMD entry still
> > points to the locked page table, and if not, retries all these steps
> > until it manages to lock a stable page table.
>
> Right yeah, I mean this is kind of a standard pattern in the kernel though
> like:
>
> 1. Grab some pointer to something
> 2. Lock
> 3. Really make sure it hasn't disappeared from under us
> 4. If so, unlock and try again
> 5. Otherwise proceed
>
> You have this pattern with folios too...

Yeah, I agree the pattern you need for the access is not that weird,
it's just weird that you need it for page tables at one specific
level.

> > > pmd_lock() at the PMD level (pud_lock() ostensibly at PUD level but this
> > > amounts to an mm->page_table_lock anyway there)
> >
> > > > I think something like
> > > > https://www.kernel.org/doc/html/latest/doc-guide/kernel-doc.html#overview-documentation-comments
> > > > is supposed to let you include the current version of the comment into
> > > > the rendered documentation HTML without having to manually keep things
> > > > in sync. I've never used that myself, but there are a bunch of
> > > > examples in the tree; for example, grep for "DMA fences overview".
> > >
> > > Ah, but this isn't a kernel doc is just a raw comment :) so I'm not sure there
> > > is a great way of grabbing just that, reliably. Maybe can turn that into a
> > > kernel doc comment in a follow up patch or something?
> >
> > Ah, yeah, sounds reasonable.
>
> Thanks.
>
>
> I think all this makes me think that we should actually have entirely
> separate top level descriptions and internals sections in this document,
> which align's again with Alice's comments.
>
> As the level of detail and caveats here mean that if you provide
> implementation details everywhere you end up constantly on a tangent
> (important, relevant internal details but to a _user_ of the functionality
> not so important).

Hmm, yeah.
Qi Zheng Nov. 6, 2024, 2:56 a.m. UTC | #25
On 2024/11/5 00:42, Lorenzo Stoakes wrote:
> On Sat, Nov 02, 2024 at 02:45:35AM +0100, Jann Horn wrote:
>> On Fri, Nov 1, 2024 at 7:50 PM Lorenzo Stoakes

[...]

>>> +
>>> +Page table locks
>>> +----------------

Many thanks to Lorenzo for documenting page table locks! This is really
needed. And at a glance, I agree with Jann's additions.

>>
>> (except last-level page tables: khugepaged already deletes those for
>> file mappings without using the mmap lock at all in
>> retract_page_tables(), and there is a pending series that will do the
>> same with page tables in other VMAs too, see
>> <https://lore.kernel.org/all/cover.1729157502.git.zhengqi.arch@bytedance.com/>)

Thanks to Jann for mentioning this series, I just updated it to v2
recently:

https://lore.kernel.org/lkml/cover.1730360798.git.zhengqi.arch@bytedance.com/

> 
> Ugh wut OK haha. Will look into this.

Thanks!
Qi Zheng Nov. 6, 2024, 3:09 a.m. UTC | #26
Hi Jann,

On 2024/11/5 05:29, Jann Horn wrote:
> On Mon, Nov 4, 2024 at 5:42 PM Lorenzo Stoakes

[...]

> 
> I think it's important to know about the existence of hardware writes
> because it means you need atomic operations when making changes to
> page tables. Like, for example, in many cases when changing a present
> PTE, you can't even use READ_ONCE()/WRITE_ONCE() for PTEs and need
> atomic RMW operations instead - see for example ptep_get_and_clear(),
> which is basically implemented in arch code as an atomic xchg so that
> it can't miss concurrent A/D bit updates.
> 

Totally agree! But I noticed before that ptep_clear() doesn't seem
to need atomic operations because it doesn't need to care about the
A/D bit.

I once looked at the history of how the ptep_clear() was introduced.
If you are interested, you can take a look at my local draft below.
Maybe I missed something.

```
mm: pgtable: make ptep_clear() non-atomic

     In the generic ptep_get_and_clear() implementation, it is just a simple
     combination of ptep_get() and pte_clear(). But for some architectures
     (such as x86 and arm64, etc), the hardware will modify the A/D bits 
of the
     page table entry, so the ptep_get_and_clear() needs to be overwritten
     and implemented as an atomic operation to avoid contention, which has a
     performance cost.

     The commit d283d422c6c4 ("x86: mm: add x86_64 support for page table
     check") adds the ptep_clear() on the x86, and makes it call
     ptep_get_and_clear() when CONFIG_PAGE_TABLE_CHECK is enabled. The page
     table check feature does not actually care about the A/D bits, so only
     ptep_get() + pte_clear() should be called. But considering that the 
page
     table check is a debug option, this should not have much of an impact.

     But then the commit de8c8e52836d ("mm: page_table_check: add hooks to
     public helpers") changed ptep_clear() to unconditionally call
     ptep_get_and_clear(), so that the  CONFIG_PAGE_TABLE_CHECK check can be
     put into the page table check stubs (in 
include/linux/page_table_check.h).
     This also cause performance loss to the kernel without
     CONFIG_PAGE_TABLE_CHECK enabled, which doesn't make sense.

     To fix it, just calling ptep_get() and pte_clear() in the ptep_clear().

     Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 117b807e3f894..2ace92293f5f5 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -506,7 +506,10 @@ static inline void clear_young_dirty_ptes(struct 
vm_area_struct *vma,
  static inline void ptep_clear(struct mm_struct *mm, unsigned long addr,
                               pte_t *ptep)
  {
-       ptep_get_and_clear(mm, addr, ptep);
+       pte_t pte = ptep_get(ptep);
+
+       pte_clear(mm, addr, ptep);
+       page_table_check_pte_clear(mm, pte);
  }

```

Thanks!
Lorenzo Stoakes Nov. 6, 2024, 11:28 a.m. UTC | #27
On Wed, Nov 06, 2024 at 10:56:29AM +0800, Qi Zheng wrote:
>
>
> On 2024/11/5 00:42, Lorenzo Stoakes wrote:
> > On Sat, Nov 02, 2024 at 02:45:35AM +0100, Jann Horn wrote:
> > > On Fri, Nov 1, 2024 at 7:50 PM Lorenzo Stoakes
>
> [...]
>
> > > > +
> > > > +Page table locks
> > > > +----------------
>
> Many thanks to Lorenzo for documenting page table locks! This is really
> needed. And at a glance, I agree with Jann's additions.

Thanks!

Will be respinning with all comments taken into account relatively soon.

>
> > >
> > > (except last-level page tables: khugepaged already deletes those for
> > > file mappings without using the mmap lock at all in
> > > retract_page_tables(), and there is a pending series that will do the
> > > same with page tables in other VMAs too, see
> > > <https://lore.kernel.org/all/cover.1729157502.git.zhengqi.arch@bytedance.com/>)
>
> Thanks to Jann for mentioning this series, I just updated it to v2
> recently:
>
> https://lore.kernel.org/lkml/cover.1730360798.git.zhengqi.arch@bytedance.com/

Yeah I need to read through a little bit as I was unaware of these paths (mm is
a big and sprawling subsystem more so than one might expect... :)

Could you cc- me on any respin, as at least an interested observer? Thanks!

>
> >
> > Ugh wut OK haha. Will look into this.
>
> Thanks!
>

:>)
Jann Horn Nov. 6, 2024, 6:09 p.m. UTC | #28
On Wed, Nov 6, 2024 at 4:09 AM Qi Zheng <zhengqi.arch@bytedance.com> wrote:
> On 2024/11/5 05:29, Jann Horn wrote:
> > On Mon, Nov 4, 2024 at 5:42 PM Lorenzo Stoakes
>
> [...]
>
> >
> > I think it's important to know about the existence of hardware writes
> > because it means you need atomic operations when making changes to
> > page tables. Like, for example, in many cases when changing a present
> > PTE, you can't even use READ_ONCE()/WRITE_ONCE() for PTEs and need
> > atomic RMW operations instead - see for example ptep_get_and_clear(),
> > which is basically implemented in arch code as an atomic xchg so that
> > it can't miss concurrent A/D bit updates.
> >
>
> Totally agree! But I noticed before that ptep_clear() doesn't seem
> to need atomic operations because it doesn't need to care about the
> A/D bit.
>
> I once looked at the history of how the ptep_clear() was introduced.
> If you are interested, you can take a look at my local draft below.
> Maybe I missed something.
>
> ```
> mm: pgtable: make ptep_clear() non-atomic
>
>      In the generic ptep_get_and_clear() implementation, it is just a simple
>      combination of ptep_get() and pte_clear(). But for some architectures
>      (such as x86 and arm64, etc), the hardware will modify the A/D bits
> of the
>      page table entry, so the ptep_get_and_clear() needs to be overwritten
>      and implemented as an atomic operation to avoid contention, which has a
>      performance cost.
>
>      The commit d283d422c6c4 ("x86: mm: add x86_64 support for page table
>      check") adds the ptep_clear() on the x86, and makes it call
>      ptep_get_and_clear() when CONFIG_PAGE_TABLE_CHECK is enabled. The page
>      table check feature does not actually care about the A/D bits, so only
>      ptep_get() + pte_clear() should be called. But considering that the
> page
>      table check is a debug option, this should not have much of an impact.
>
>      But then the commit de8c8e52836d ("mm: page_table_check: add hooks to
>      public helpers") changed ptep_clear() to unconditionally call
>      ptep_get_and_clear(), so that the  CONFIG_PAGE_TABLE_CHECK check can be
>      put into the page table check stubs (in
> include/linux/page_table_check.h).
>      This also cause performance loss to the kernel without
>      CONFIG_PAGE_TABLE_CHECK enabled, which doesn't make sense.
>
>      To fix it, just calling ptep_get() and pte_clear() in the ptep_clear().
>
>      Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 117b807e3f894..2ace92293f5f5 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -506,7 +506,10 @@ static inline void clear_young_dirty_ptes(struct
> vm_area_struct *vma,
>   static inline void ptep_clear(struct mm_struct *mm, unsigned long addr,
>                                pte_t *ptep)
>   {
> -       ptep_get_and_clear(mm, addr, ptep);
> +       pte_t pte = ptep_get(ptep);
> +
> +       pte_clear(mm, addr, ptep);
> +       page_table_check_pte_clear(mm, pte);
>   }
>
> ```

ptep_clear() is currently only used in debug code and in khugepaged
collapse paths, which are fairly expensive, so I don't think the cost
of an extra atomic RMW op should matter here; but yeah, the change
looks correct to me.
Qi Zheng Nov. 7, 2024, 6:47 a.m. UTC | #29
On 2024/11/6 19:28, Lorenzo Stoakes wrote:
> On Wed, Nov 06, 2024 at 10:56:29AM +0800, Qi Zheng wrote:
>>
>>
>> On 2024/11/5 00:42, Lorenzo Stoakes wrote:
>>> On Sat, Nov 02, 2024 at 02:45:35AM +0100, Jann Horn wrote:
>>>> On Fri, Nov 1, 2024 at 7:50 PM Lorenzo Stoakes
>>
>> [...]
>>
>>>>> +
>>>>> +Page table locks
>>>>> +----------------
>>
>> Many thanks to Lorenzo for documenting page table locks! This is really
>> needed. And at a glance, I agree with Jann's additions.
> 
> Thanks!
> 
> Will be respinning with all comments taken into account relatively soon.
> 
>>
>>>>
>>>> (except last-level page tables: khugepaged already deletes those for
>>>> file mappings without using the mmap lock at all in
>>>> retract_page_tables(), and there is a pending series that will do the
>>>> same with page tables in other VMAs too, see
>>>> <https://lore.kernel.org/all/cover.1729157502.git.zhengqi.arch@bytedance.com/>)
>>
>> Thanks to Jann for mentioning this series, I just updated it to v2
>> recently:
>>
>> https://lore.kernel.org/lkml/cover.1730360798.git.zhengqi.arch@bytedance.com/
> 
> Yeah I need to read through a little bit as I was unaware of these paths (mm is
> a big and sprawling subsystem more so than one might expect... :)
> 
> Could you cc- me on any respin, as at least an interested observer? Thanks!

Sure, will cc- you in the next version. ;)

> 
>>
>>>
>>> Ugh wut OK haha. Will look into this.
>>
>> Thanks!
>>
> 
> :>)
Qi Zheng Nov. 7, 2024, 7:07 a.m. UTC | #30
On 2024/11/7 02:09, Jann Horn wrote:
> On Wed, Nov 6, 2024 at 4:09 AM Qi Zheng <zhengqi.arch@bytedance.com> wrote:
>> On 2024/11/5 05:29, Jann Horn wrote:
>>> On Mon, Nov 4, 2024 at 5:42 PM Lorenzo Stoakes
>>
>> [...]
>>
>>>
>>> I think it's important to know about the existence of hardware writes
>>> because it means you need atomic operations when making changes to
>>> page tables. Like, for example, in many cases when changing a present
>>> PTE, you can't even use READ_ONCE()/WRITE_ONCE() for PTEs and need
>>> atomic RMW operations instead - see for example ptep_get_and_clear(),
>>> which is basically implemented in arch code as an atomic xchg so that
>>> it can't miss concurrent A/D bit updates.
>>>
>>
>> Totally agree! But I noticed before that ptep_clear() doesn't seem
>> to need atomic operations because it doesn't need to care about the
>> A/D bit.
>>
>> I once looked at the history of how the ptep_clear() was introduced.
>> If you are interested, you can take a look at my local draft below.
>> Maybe I missed something.
>>
>> ```
>> mm: pgtable: make ptep_clear() non-atomic
>>
>>       In the generic ptep_get_and_clear() implementation, it is just a simple
>>       combination of ptep_get() and pte_clear(). But for some architectures
>>       (such as x86 and arm64, etc), the hardware will modify the A/D bits
>> of the
>>       page table entry, so the ptep_get_and_clear() needs to be overwritten
>>       and implemented as an atomic operation to avoid contention, which has a
>>       performance cost.
>>
>>       The commit d283d422c6c4 ("x86: mm: add x86_64 support for page table
>>       check") adds the ptep_clear() on the x86, and makes it call
>>       ptep_get_and_clear() when CONFIG_PAGE_TABLE_CHECK is enabled. The page
>>       table check feature does not actually care about the A/D bits, so only
>>       ptep_get() + pte_clear() should be called. But considering that the
>> page
>>       table check is a debug option, this should not have much of an impact.
>>
>>       But then the commit de8c8e52836d ("mm: page_table_check: add hooks to
>>       public helpers") changed ptep_clear() to unconditionally call
>>       ptep_get_and_clear(), so that the  CONFIG_PAGE_TABLE_CHECK check can be
>>       put into the page table check stubs (in
>> include/linux/page_table_check.h).
>>       This also cause performance loss to the kernel without
>>       CONFIG_PAGE_TABLE_CHECK enabled, which doesn't make sense.
>>
>>       To fix it, just calling ptep_get() and pte_clear() in the ptep_clear().
>>
>>       Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>>
>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>> index 117b807e3f894..2ace92293f5f5 100644
>> --- a/include/linux/pgtable.h
>> +++ b/include/linux/pgtable.h
>> @@ -506,7 +506,10 @@ static inline void clear_young_dirty_ptes(struct
>> vm_area_struct *vma,
>>    static inline void ptep_clear(struct mm_struct *mm, unsigned long addr,
>>                                 pte_t *ptep)
>>    {
>> -       ptep_get_and_clear(mm, addr, ptep);
>> +       pte_t pte = ptep_get(ptep);
>> +
>> +       pte_clear(mm, addr, ptep);
>> +       page_table_check_pte_clear(mm, pte);
>>    }
>>
>> ```
> 
> ptep_clear() is currently only used in debug code and in khugepaged
> collapse paths, which are fairly expensive, so I don't think the cost
> of an extra atomic RMW op should matter here; but yeah, the change
> looks correct to me.

Thanks for double-checking it! And I agree that an extra atomic RMW op
is not a problem in the current call path. But this may be used for
other paths in the future. After all, for the present pte entry, we
need to call ptep_clear() instead of pte_clear() to ensure that
PAGE_TABLE_CHECK works properly.

Maybe this is worth sending a formal patch. ;)

Thanks!
Hillf Danton Nov. 7, 2024, 11:07 a.m. UTC | #31
On Fri,  1 Nov 2024 18:50:33 +0000 Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> 
> Locking around VMAs is complicated and confusing. While we have a number of
> disparate comments scattered around the place, we seem to be reaching a
> level of complexity that justifies a serious effort at clearly documenting
> how locks are expected to be interacted with when it comes to interacting
> with mm_struct and vm_area_struct objects.
> 
> This is especially pertinent as regards efforts to find sensible
> abstractions for these fundamental objects within the kernel rust
> abstraction whose compiler strictly requires some means of expressing these
> rules (and through this expression can help self-document these
> requirements as well as enforce them which is an exciting concept).
> 
> The document limits scope to mmap and VMA locks and those that are
> immediately adjacent and relevant to them - so additionally covers page
> table locking as this is so very closely tied to VMA operations (and relies
> upon us handling these correctly).
> 
> The document tries to cover some of the nastier and more confusing edge
> cases and concerns especially around lock ordering and page table teardown.
>
What is missed is the clear guide to the correct locking order.
Is the order below correct for instance?

	lock  vma
	lock  vma->vm_mm
Lorenzo Stoakes Nov. 7, 2024, 11:15 a.m. UTC | #32
On Thu, Nov 07, 2024 at 07:07:17PM +0800, Hillf Danton wrote:
> On Fri,  1 Nov 2024 18:50:33 +0000 Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> >
> > Locking around VMAs is complicated and confusing. While we have a number of
> > disparate comments scattered around the place, we seem to be reaching a
> > level of complexity that justifies a serious effort at clearly documenting
> > how locks are expected to be interacted with when it comes to interacting
> > with mm_struct and vm_area_struct objects.
> >
> > This is especially pertinent as regards efforts to find sensible
> > abstractions for these fundamental objects within the kernel rust
> > abstraction whose compiler strictly requires some means of expressing these
> > rules (and through this expression can help self-document these
> > requirements as well as enforce them which is an exciting concept).
> >
> > The document limits scope to mmap and VMA locks and those that are
> > immediately adjacent and relevant to them - so additionally covers page
> > table locking as this is so very closely tied to VMA operations (and relies
> > upon us handling these correctly).
> >
> > The document tries to cover some of the nastier and more confusing edge
> > cases and concerns especially around lock ordering and page table teardown.
> >
> What is missed is the clear guide to the correct locking order.
> Is the order below correct for instance?
>
> 	lock  vma
> 	lock  vma->vm_mm

There's a whole section on lock ordering (albeit, a copy/paste of mm/rmap.c
comment).

However I do agree that I didn't put enough emphasis on lock ordering for VMA
locks.

I'm working on v2 now (aside: my god you won't believe how surprisingly
challenging it is to write docs, I mean you'd think I'd know after book but I
forgot I guess :) where I put a very strong emphasis on this locking order,
including reflecting Suren and Jann's input on read mmap lock vs. vma read lock
(you'd probably not want to bother with vma read lock if you have an mmap read
lock, but the latter has to be taken before the former if you do - the other way
round is a deadlock).

The v2 respin puts a much stronger emphasis on separating a top-level practical
guide to what locks to acquire and where in what order vs. implementation
details as per the valuable feedback on this from Alice + others.

So TL;DR - yes agree absolutely and this is made clearer in v2!
diff mbox series

Patch

diff --git a/Documentation/mm/index.rst b/Documentation/mm/index.rst
index 0be1c7503a01..da5f30acaca5 100644
--- a/Documentation/mm/index.rst
+++ b/Documentation/mm/index.rst
@@ -64,3 +64,4 @@  documentation, or deleted if it has served its purpose.
    vmemmap_dedup
    z3fold
    zsmalloc
+   vma_locks
diff --git a/Documentation/mm/vma_locks.rst b/Documentation/mm/vma_locks.rst
new file mode 100644
index 000000000000..52b9d484376a
--- /dev/null
+++ b/Documentation/mm/vma_locks.rst
@@ -0,0 +1,527 @@ 
+VMA Locking
+===========
+
+Overview
+--------
+
+Userland memory ranges are tracked by the kernel via Virtual Memory Areas or
+'VMA's of type `struct vm_area_struct`.
+
+Each VMA describes a virtually contiguous memory range with identical
+attributes, each of which described by a `struct vm_area_struct`
+object. Userland access outside of VMAs is invalid except in the case where an
+adjacent stack VMA could be extended to contain the accessed address.
+
+All VMAs are contained within one and only one virtual address space, described
+by a `struct mm_struct` object which is referenced by all tasks (that is,
+threads) which share the virtual address space. We refer to this as the `mm`.
+
+Each mm object contains a maple tree data structure which describes all VMAs
+within the virtual address space.
+
+The kernel is designed to be highly scalable against concurrent access to
+userland memory, so a complicated set of locks are required to ensure no data
+races or memory corruption occurs.
+
+This document explores this locking in detail.
+
+.. note::
+
+   There are three different things that a user might want to achieve via
+   locks - the first of which is **stability**. That is - ensuring that the VMA
+   won't be freed or modified in any way from underneath us.
+
+   All MM and VMA locks ensure stability.
+
+   Secondly we have locks which allow **reads** but not writes (and which might
+   be held concurrent with other CPUs who also hold the read lock).
+
+   Finally, we have locks which permit exclusive access to the VMA to allow for
+   **writes** to the VMA.
+
+MM and VMA locks
+----------------
+
+There are two key classes of lock utilised when reading and manipulating VMAs -
+the `mmap_lock` which is a read/write semaphore maintained at the `mm_struct`
+level of granularity and, if CONFIG_PER_VMA_LOCK is set, a per-VMA lock at the
+VMA level of granularity.
+
+.. note::
+
+   Generally speaking, a read/write semaphore is a class of lock which permits
+   concurrent readers. However a write lock can only be obtained once all
+   readers have left the critical region (and pending readers made to wait).
+
+   This renders read locks on a read/write semaphore concurrent with other
+   readers and write locks exclusive against all others holding the semaphore.
+
+If CONFIG_PER_VMA_LOCK is not set, then things are relatively simple - a write
+mmap lock gives you exclusive write access to a VMA, and a read lock gives you
+concurrent read-only access.
+
+In the presence of CONFIG_PER_VMA_LOCK, i.e. VMA locks, things are more
+complicated. In this instance, a write semaphore is no longer enough to gain
+exclusive access to a VMA, a VMA write lock is also required.
+
+The VMA lock is implemented via the use of both a read/write semaphore and
+per-VMA and per-mm sequence numbers. We go into detail on this in the VMA lock
+internals section below, so for the time being it is important only to note that
+we can obtain either a VMA read or write lock.
+
+.. note::
+
+   VMAs under VMA **read** lock are obtained by the `lock_vma_under_rcu()`
+   function, and **no** existing mmap or VMA lock must be held, This function
+   either returns a read-locked VMA, or NULL if the lock could not be
+   acquired. As the name suggests, the VMA will be acquired under RCU, though
+   once obtained, remains stable.
+
+   This kind of locking is entirely optimistic - if the lock is contended or a
+   competing write has started, then we do not obtain a read lock.
+
+   The `lock_vma_under_rcu()` function first calls `rcu_read_lock()` to ensure
+   that the VMA is acquired in an RCU critical section, then attempts to VMA
+   lock it via `vma_start_read()`, before releasing the RCU lock via
+   `rcu_read_unlock()`.
+
+   VMA read locks hold the a read lock on the `vma->vm_lock` semaphore for their
+   duration and the caller of `lock_vma_under_rcu()` must release it via
+   `vma_end_read()`.
+
+   VMA **write** locks are acquired via `vma_start_write()` in instances where a
+   VMA is about to be modified, unlike `vma_start_read()` the lock is always
+   acquired. An mmap write lock **must** be held for the duration of the VMA
+   write lock, releasing or downgrading the mmap write lock also releases the
+   VMA write lock so there is no `vma_end_write()` function.
+
+   Note that a semaphore write lock is not held across a VMA lock. Rather, a
+   sequence number is used for serialisation, and the write semaphore is only
+   acquired at the point of write lock to update this (we explore this in detail
+   in the VMA lock internals section below).
+
+   This ensures the semantics we require - VMA write locks provide exclusive
+   write access to the VMA.
+
+Examining all valid lock state and what each implies:
+
+.. list-table::
+   :header-rows: 1
+
+   * - mmap lock
+     - VMA lock
+     - Stable?
+     - Can read safely?
+     - Can write safely?
+   * - \-
+     - \-
+     - N
+     - N
+     - N
+   * - R
+     - \-
+     - Y
+     - Y
+     - N
+   * - \-
+     - R
+     - Y
+     - Y
+     - N
+   * - W
+     - \-
+     - Y
+     - Y
+     - N
+   * - W
+     - W
+     - Y
+     - Y
+     - Y
+
+Note that there are some exceptions to this - the `anon_vma` field is permitted
+to be written to under mmap read lock and is instead serialised by the `struct
+mm_struct` field `page_table_lock`. In addition the `vm_mm` and all
+lock-specific fields are permitted to be read under RCU alone  (though stability cannot
+be expected in this instance).
+
+.. note::
+   The most notable place to use the VMA read lock is on page table faults on
+   the x86-64 architecture, which importantly means that without a VMA write
+   lock, page faults can race against you even if you hold an mmap write lock.
+
+VMA Fields
+----------
+
+We examine each field of the `struct vm_area_struct` type in detail in the table
+below.
+
+Reading of each field requires either an mmap read lock or a VMA read lock to be
+held, except where 'unstable RCU read' is specified, in which case unstable
+access to the field is permitted under RCU alone.
+
+The table specifies which write locks must be held to write to the field.
+
+.. list-table::
+   :widths: 20 10 22 5 20
+   :header-rows: 1
+
+   * - Field
+     - Config
+     - Description
+     - Unstable RCU read?
+     - Write Lock
+   * - vm_start
+     -
+     - Inclusive start virtual address of range VMA describes.
+     -
+     - mmap write, VMA write
+   * - vm_end
+     -
+     - Exclusive end virtual address of range VMA describes.
+     -
+     - mmap write, VMA write
+   * - vm_rcu
+     - vma lock
+     - RCU list head, in union with vma_start, vma_end. RCU implementation detail.
+     - N/A
+     - N/A
+   * - vm_mm
+     -
+     - Containing mm_struct.
+     - Y
+     - (Static)
+   * - vm_page_prot
+     -
+     - Architecture-specific page table protection bits determined from VMA
+       flags
+     -
+     - mmap write, VMA write
+   * - vm_flags
+     -
+     - Read-only access to VMA flags describing attributes of VMA, in union with
+       private writable `__vm_flags`.
+     -
+     - N/A
+   * - __vm_flags
+     -
+     - Private, writable access to VMA flags field, updated by vm_flags_*()
+       functions.
+     -
+     - mmap write, VMA write
+   * - detached
+     - vma lock
+     - VMA lock implementation detail - indicates whether the VMA has been
+       detached from the tree.
+     - Y
+     - mmap write, VMA write
+   * - vm_lock_seq
+     - vma lock
+     - VMA lock implementation detail - A sequence number used to serialise the
+       VMA lock, see the VMA lock section below.
+     - Y
+     - mmap write, VMA write
+   * - vm_lock
+     - vma lock
+     - VMA lock implementation detail - A pointer to the VMA lock read/write
+       semaphore.
+     - Y
+     - None required
+   * - shared.rb
+     -
+     - A red/black tree node used, if the mapping is file-backed, to place the
+       VMA in the `struct address_space->i_mmap` red/black interval tree.
+     -
+     - mmap write, VMA write, i_mmap write
+   * - shared.rb_subtree_last
+     -
+     - Metadata used for management of the interval tree if the VMA is
+       file-backed.
+     -
+     - mmap write, VMA write, i_mmap write
+   * - anon_vma_chain
+     -
+     - List of links to forked/CoW'd `anon_vma` objects.
+     -
+     - mmap read or above, anon_vma write lock
+   * - anon_vma
+     -
+     - `anon_vma` object used by anonymous folios mapped exclusively to this VMA.
+     -
+     - mmap read or above, page_table_lock
+   * - vm_ops
+     -
+     - If the VMA is file-backed, then either the driver or file-system provides
+       a `struct vm_operations_struct` object describing callbacks to be invoked
+       on specific VMA lifetime events.
+     -
+     - (Static)
+   * - vm_pgoff
+     -
+     - Describes the page offset into the file, the original page offset within
+       the virtual address space (prior to any `mremap()`), or PFN if a PFN map.
+     -
+     - mmap write, VMA write
+   * - vm_file
+     -
+     - If the VMA is file-backed, points to a `struct file` object describing
+       the underlying file, if anonymous then `NULL`.
+     -
+     - (Static)
+   * - vm_private_data
+     -
+     - A `void *` field for driver-specific metadata.
+     -
+     - Driver-mandated.
+   * - anon_name
+     - anon name
+     - A field for storing a `struct anon_vma_name` object providing a name for
+       anonymous mappings, or `NULL` if none is set or the VMA is file-backed.
+     -
+     - mmap write, VMA write
+   * - swap_readahead_info
+     - swap
+     - Metadata used by the swap mechanism to perform readahead.
+     -
+     - mmap read
+   * - vm_region
+     - nommu
+     - The containing region for the VMA for architectures which do not
+       possess an MMU.
+     - N/A
+     - N/A
+   * - vm_policy
+     - numa
+     - `mempolicy` object which describes NUMA behaviour of the VMA.
+     -
+     - mmap write, VMA write
+   * - numab_state
+     - numab
+     - `vma_numab_state` object which describes the current state of NUMA
+       balancing in relation to this VMA.
+     -
+     - mmap write, VMA write
+   * - vm_userfaultfd_ctx
+     -
+     - Userfaultfd context wrapper object of type `vm_userfaultfd_ctx`, either
+       of zero size if userfaultfd is disabled, or containing a pointer to an
+       underlying `userfaultfd_ctx` object which describes userfaultfd metadata.
+     -
+     - mmap write, VMA write
+
+.. note::
+
+   In the config column 'vma lock' configuration means CONFIG_PER_VMA_LOCK,
+   'anon name' means CONFIG_ANON_VMA_NAME, 'swap' means CONFIG_SWAP, 'nommu'
+   means that CONFIG_MMU is not set, 'numa' means CONFIG_NUMA and 'numab' means
+   CONFIG_NUMA_BALANCING'.
+
+   In the write lock column '(Static)' means that the field is set only once
+   upon initialisation of the VMA and not changed after this, the VMA would
+   either have been under an mmap write and VMA write lock at the time or not
+   yet inserted into any tree.
+
+Page table locks
+----------------
+
+When allocating a P4D, PUD or PMD and setting the relevant entry in the above
+PGD, P4D or PUD, the `mm->page_table_lock` is acquired to do so. This is
+acquired in `__p4d_alloc()`, `__pud_alloc()` and `__pmd_alloc()` respectively.
+
+.. note::
+   `__pmd_alloc()` actually invokes `pud_lock()` and `pud_lockptr()` in turn,
+   however at the time of writing it ultimately references the
+   `mm->page_table_lock`.
+
+Allocating a PTE will either use the `mm->page_table_lock` or, if
+`USE_SPLIT_PMD_PTLOCKS` is defined, used a lock embedded in the PMD physical
+page metadata in the form of a `struct ptdesc`, acquired by `pmd_ptdesc()`
+called from `pmd_lock()` and ultimately `__pte_alloc()`.
+
+Finally, modifying the contents of the PTE has special treatment, as this is a
+lock that we must acquire whenever we want stable and exclusive access to
+entries pointing to data pages within a PTE, especially when we wish to modify
+them.
+
+This is performed via `pte_offset_map_lock()` which carefully checks to ensure
+that the PTE hasn't changed from under us, ultimately invoking `pte_lockptr()`
+to obtain a spin lock at PTE granularity contained within the `struct ptdesc`
+associated with the physical PTE page. The lock must be released via
+`pte_unmap_unlock()`.
+
+.. note::
+   There are some variants on this, such as `pte_offset_map_rw_nolock()` when we
+   know we hold the PTE stable but for brevity we do not explore this.
+   See the comment for `__pte_offset_map_lock()` for more details.
+
+When modifying data in ranges we typically only wish to allocate higher page
+tables as necessary, using these locks to avoid races or overwriting anything,
+and set/clear data at the PTE level as required (for instance when page faulting
+or zapping).
+
+Page table teardown
+-------------------
+
+Tearing down page tables themselves is something that requires significant
+care. There must be no way that page tables designated for removal can be
+traversed or referenced by concurrent tasks.
+
+It is insufficient to simply hold an mmap write lock and VMA lock (which will
+prevent racing faults, and rmap operations), as a file-backed mapping can be
+truncated under the `struct address_space` i_mmap_lock alone.
+
+As a result, no VMA which can be accessed via the reverse mapping (either
+anon_vma or the `struct address_space->i_mmap` interval tree) can have its page
+tables torn down.
+
+The operation is typically performed via `free_pgtables()`, which assumes either
+the mmap write lock has been taken (as specified by its `mm_wr_locked`
+parameter), or that it the VMA is fully detached.
+
+It carefully removes the VMA from all reverse mappings, however it's important
+that no new ones overlap these or any route remain to permit access to addresses
+within the range whose page tables are being torn down.
+
+As a result of these careful conditions, note that page table entries are
+cleared without page table locks, as it is assumed that all of these precautions
+have already been taken.
+
+mmap write lock downgrading
+---------------------------
+
+While it is possible to obtain an mmap write or read lock using the
+`mm->mmap_lock` read/write semaphore, it is also possible to **downgrade** from
+a write lock to a read lock via `mmap_write_downgrade()`.
+
+Similar to `mmap_write_unlock()`, this implicitly terminates all VMA write locks
+via `vma_end_write_all()` (more or this behaviour in the VMA lock internals
+section below), but importantly does not relinquish the mmap lock while
+downgrading, therefore keeping the locked virtual address space stable.
+
+A subtlety here is that callers can assume, if they invoke an
+mmap_write_downgrade() operation, that they still have exclusive access to the
+virtual address space (excluding VMA read lock holders), as for another task to
+have downgraded they would have had to have exclusive access to the semaphore
+which can't be the case until the current task completes what it is doing.
+
+Stack expansion
+---------------
+
+Stack expansion throws up additional complexities in that we cannot permit there
+to be racing page faults, as a result we invoke `vma_start_write()` to prevent
+this in `expand_downwards()` or `expand_upwards()`.
+
+Lock ordering
+-------------
+
+As we have multiple locks across the kernel which may or may not be taken at the
+same time as explicit mm or VMA locks, we have to be wary of lock inversion, and
+the **order** in which locks are acquired and released becomes very important.
+
+.. note::
+
+   Lock inversion occurs when two threads need to acquire multiple locks,
+   but in doing so inadvertently cause a mutual deadlock.
+
+   For example, consider thread 1 which holds lock A and tries to acquire lock B,
+   while thread 2 holds lock B and tries to acquire lock A.
+
+   Both threads are now deadlocked on each other. However, had they attempted to
+   acquire locks in the same order, one would have waited for the other to
+   complete its work and no deadlock would have occurred.
+
+The opening comment in `mm/rmap.c` describes in detail the required ordering of
+locks within memory management code:
+
+.. code-block::
+
+  inode->i_rwsem	(while writing or truncating, not reading or faulting)
+    mm->mmap_lock
+      mapping->invalidate_lock (in filemap_fault)
+        folio_lock
+          hugetlbfs_i_mmap_rwsem_key (in huge_pmd_share, see hugetlbfs below)
+            vma_start_write
+              mapping->i_mmap_rwsem
+                anon_vma->rwsem
+                  mm->page_table_lock or pte_lock
+                    swap_lock (in swap_duplicate, swap_info_get)
+                      mmlist_lock (in mmput, drain_mmlist and others)
+                      mapping->private_lock (in block_dirty_folio)
+                          i_pages lock (widely used)
+                            lruvec->lru_lock (in folio_lruvec_lock_irq)
+                      inode->i_lock (in set_page_dirty's __mark_inode_dirty)
+                      bdi.wb->list_lock (in set_page_dirty's __mark_inode_dirty)
+                        sb_lock (within inode_lock in fs/fs-writeback.c)
+                        i_pages lock (widely used, in set_page_dirty,
+                                  in arch-dependent flush_dcache_mmap_lock,
+                                  within bdi.wb->list_lock in __sync_single_inode)
+
+Please check the current state of this comment which may have changed since the
+time of writing of this document.
+
+VMA lock internals
+------------------
+
+The VMA lock mechanism is designed to be a lightweight means of avoiding the use
+of the heavily contended mmap lock. It is implemented using a combination of a
+read/write semaphore and sequence numbers belonging to the containing `struct
+mm_struct` and the VMA.
+
+Read locks are acquired via `vma_start_read()`, which is an optimistic
+operation, i.e. it tries to acquire a read lock but returns false if it is
+unable to do so. At the end of the read operation, `vma_end_read()` is called to
+release the VMA read lock. This can be done under RCU alone.
+
+Writing requires the mmap to be write-locked and the VMA lock to be acquired via
+`vma_start_write()`, however the write lock is released by the termination or
+downgrade of the mmap write lock so no `vma_end_write()` is required.
+
+All this is achieved by the use of per-mm and per-VMA sequence counts. This is
+used to reduce complexity, and potential especially around operations which
+write-lock multiple VMAs at once.
+
+If the mm sequence count, `mm->mm_lock_seq` is equal to the VMA sequence count
+`vma->vm_lock_seq` then the VMA is write-locked. If they differ, then they are
+not.
+
+Each time an mmap write lock is acquired in `mmap_write_lock()`,
+`mmap_write_lock_nested()`, `mmap_write_lock_killable()`, the `mm->mm_lock_seq`
+sequence number is incremented via `mm_lock_seqcount_begin()`.
+
+Each time the mmap write lock is released in `mmap_write_unlock()` or
+`mmap_write_downgrade()`, `vma_end_write_all()` is invoked which also increments
+`mm->mm_lock_seq` via `mm_lock_seqcount_end()`.
+
+This way, we ensure regardless of the VMA's sequence number count, that a write
+lock is not incorrectly indicated (since we increment the sequence counter on
+acquiring the mmap write lock, which is required in order to obtain a VMA write
+lock), and that when we release an mmap write lock, we efficiently release
+**all** VMA write locks contained within the mmap at the same time.
+
+The exclusivity of the mmap write lock ensures this is what we want, as there
+would never be a reason to persist per-VMA write locks across multiple mmap
+write lock acquisitions.
+
+Each time a VMA read lock is acquired, we acquire a read lock on the
+`vma->vm_lock` read/write semaphore and hold it, while checking that the
+sequence count of the VMA does not match that of the mm.
+
+If it does, the read lock fails. If it does not, we hold the lock, excluding
+writers, but permitting other readers, who will also obtain this lock under RCU.
+
+Importantly, maple tree operations performed in `lock_vma_under_rcu()` are also
+RCU safe, so the whole read lock operation is guaranteed to function correctly.
+
+On the write side, we acquire a write lock on the `vma->vm_lock` read/write
+semaphore, before setting the VMA's sequence number under this lock, also
+simultaneously holding the mmap write lock.
+
+This way, if any read locks are in effect, `vma_start_write()` will sleep until
+these are finished and mutual exclusion is achieved.
+
+After setting the VMA's sequence number, the lock is released, avoiding
+complexity with a long-term held write lock.
+
+This clever combination of a read/write semaphore and sequence count allows for
+fast RCU-based per-VMA lock acquisition (especially on x86-64 page fault, though
+utilised elsewhere) with minimal complexity around lock ordering.