new file mode 100644
@@ -0,0 +1,214 @@
+.. SPDX-License-Identifier: GPL-2.0-only
+
+===================
+kernfs hashed locks
+===================
+
+kernfs uses following hashed locks
+
+1. Hashed mutexes
+2. Hashed rwsem
+
+In certain cases hashed rwsem needs to work in conjunction with a per-fs mutex
+(Described further below).So this document describes this mutex as well.
+
+A kernfs_global_locks object (defined below) provides hashed mutexes,
+hashed spinlocks and hashed rwsems.
+
+ struct kernfs_global_locks {
+ struct mutex open_file_mutex[NR_KERNFS_LOCKS];
+ struct rw_semaphore kernfs_rwsem[NR_KERNFS_LOCKS];
+ };
+
+For all hashed locks address of a kernfs_node object acts as hashing key.
+
+For the remainder of this document a node means a kernfs_node object. The
+node can refer to a file, directory or symlink of a kernfs based file system.
+Also a node's mutex or rwsem refers to hashed mutex, or hashed rwsem
+corresponding to the node. It does not mean any locking construct embedded in
+the kernfs_node itself.
+
+What is protected by hashed locks
+=================================
+
+(1) There's one kernfs_open_file for each open file and all kernfs_open_file
+ instances corresponding to a kernfs_node are maintained in a list.
+ hashed mutexes or kernfs_global_locks.open_file_mutex[index] protects
+ this list.
+
+(2) Hashed rwsems or kernfs_global_locks.kernfs_rwsem[index] protects node's
+ state and synchronizes operations that change state of a node or depend on
+ the state of a node.
+
+(3) per-fs mutex (mentioned earlier) provides synchronization between lookup
+ and remove operations. It also protects against topology change.
+ While looking for a node we will not have address of corresponding node
+ so we can't acquire node's rwsem right from the beginning.
+ On the other hand a parallel remove operation for the same node can acquire
+ corresponding rwsem and go ahead with node removal. So it may happen that
+ search operation for the node finds and returns it but before it can be
+ pinned or used, the remove operation, that was going on in parallel, removes
+ the node and hence makes its any future use wrong.
+ per-fs mutex ensures that for competing search and remove operations only
+ one proceeds at a time and since object returned by search is pinned before
+ releasing the per-fs mutex, it will be available for subsequent usage.
+
+ This per-fs mutex also protects against topology change during path walks.
+ During path walks we need to acquire and release rwsems corresponding to
+ directories so that these directories don't move and their children RB tree
+ does not change. Since these rwsems can't be taken under a spinlock,
+ kernfs_rename_lock can't be used and needed protection against topology
+ change is provided by per-fs mutex.
+
+Lock usage and proof
+=======================
+
+(1) Hashed mutexes
+
+ Since hashed mutexes protect the list of kernfs_open_file instances
+ corresponding to a kernfs_node, ->open and ->release backends of
+ file_operations need to acquire hashed mutex corresponding to kernfs_node.
+ Also when a kernfs_node is removed, all of its kernfs_open_file instances
+ are drained after deactivating the node. This drain operation acquires
+ hashed mutex to traverse list of kernfs_open_file instances.
+ So addition (via ->open), deletion (via ->release) and traversal
+ (during kernfs_drain) of kernfs_open_file list occurs in a synchronous
+ manner.
+
+(2) Hashed rwsems
+
+ 3.1. A node's rwsem protects its state and needs to be acquired to:
+ 3.1.a. Remove the node
+ 3.1.b. Move the node
+ 3.1.c. Travers or modify a node's children RB tree (for
+ directories), i.e to add/remove files/subdirectories
+ within/from a directory.
+ 3.1.d. Modify or access node's inode attributes
+
+ 3.2. Hashed rwsems are used in following operations:
+
+ 3.2.a. Addition of a new node
+
+ While adding a new kernfs_node under a kernfs directory
+ kernfs_add_one acquires directory node's rwsem for
+ writing. Clause 3.1.a ensures that directory exists
+ throughout the operation. Clause 3.1.c ensures proper
+ updation of children rb tree (i.e ->dir.children).
+ Clause 3.1.d ensures correct modification of inode
+ attribute to reflect timestamp of this operation.
+ If the directory gets removed while waiting for semaphore,
+ the subsequent checks in kernfs_add_one will fail resulting
+ in early bail out from kernfs_add_one.
+
+ 3.2.b. Removal of a node
+
+ Removal of a node involves recursive removal of all of its
+ descendants as well. per-fs mutex (i.e kernfs_rm_mutex) avoids
+ concurrent node removals even if the nodes are different.
+
+ At first node's rwsem is acquired. Clause 3.1.c avoids parallel
+ modification of descendant tree and while holding this rwsem
+ each of the descendants are deactivated.
+
+ Once a descendant has been deactivated and drained, its parent's
+ rwsem is taken. Clause 3.1.c ensures proper unlinking of this
+ descendant from its siblings. Clause 3.1.d ensures that parent's
+ inode attributes are correctly updated to record time stamp of
+ removal.
+
+ 3.2.c. Movement of a node
+
+ Moving or renaming a node (kernfs_rename_ns) acquires rwsem for
+ node and its old and new parents. Clauses 3.1.b and 3.1.c avoid
+ concurrent move operations for the same node.
+ Also if old parent of a node changes while waiting for rwsem,
+ the acquisition of rwsem for 3 involved nodes is attempted
+ again. It is always ensured that as far as old parent is
+ concerned, rwsem corresponding to current parent is acquired.
+
+ 3.2.d. Reading a directory
+
+ For diectory reading kernfs_fop_readdir acquires directory
+ node's rwsem for reading. Clause 3.1.c ensures a consistent view
+ of children RB tree.
+ As far as directroy being read is concerned, if it gets removed
+ while waiting for semaphore, the for loop that iterates through
+ children will be ineffective. So for this operation acquiring
+ directory node's rwsem for reading is enough.
+
+ 3.2.e. Dentry revalidation
+
+ A dentry revalidation (kernfs_dop_revalidate) can happen for a
+ negative or for a normal dentry.
+ For negative dentries we just need to check parent change, so in
+ this case acquiring parent kernfs_node's rwsem for reading is
+ enough.
+ For a normal dentry acquiring node's rwsem for reading is enough
+ (Clause 3.1.a and 3.1.b).
+ If node gets removed while waiting for the lock subsequent checks
+ in kernfs_dop_revalidate will fail and kernfs_dop_revalidate will
+ exit early.
+
+ 3.2.f. kernfs_node lookup
+
+ While searching for a node under a given parent
+ (kernfs_find_and_get_ns, kernfs_walk_and_get_ns) rwsem of parent
+ node is acquired for reading. Clause 3.1.c ensures a consistent
+ view of parent's children RB tree. To avoid parallel removal of
+ found node before it gets pinned, these operation make use of
+ per-fs mutex (kernfs_rm_mutex) as explained earlier.
+ This per-fs mutex is also taken during kernfs_node removal
+ (__kernfs_remove).
+
+ If the node being searched gets removed while waiting for the
+ mutex or rwsem, the subsequent kernfs_find_ns or kernfs_walk_ns
+ will fail.
+
+ 3.2.g. kenfs_node's inode lookup
+
+ Looking up for inode instances via kernfs_iop_lookup involves
+ node lookup. So locks acquired are same as ones required in 3.2.f.
+ Also once node lookup is complete parent's rwsem is released and
+ rwsem of found node is acquired to get corresponding inode.
+ Since we are operating under per-fs kernfs_rm_mutex the found node
+ will not disappear in the middle.
+
+ 3.2.h. Updating or reading inode attribute
+
+ Interfaces that change inode attributes(i.e kernfs_setattr and
+ kernfs_iop_setattr) acquire node's rwsem for writing.
+ If the kernfs_node gets removed while waiting for the semaphore
+ the subsequent __kernfs_setattr will fail.
+ From 3.2.a and 3.2.b we know that updates due to addition or
+ removal of nodes will not happen in parallel.
+ So just locking the kernfs_node in these cases is enough to
+ guarantee correct modification of inode attributes.
+ Similarly the interfaces that read inode attributes
+ (i.e kernfs_iop_getattr, kernfs_iop_permission) just need to
+ acquire involved node's rwsem for reading.
+
+ 3.2.i. kernfs file event generation
+
+ kernfs_notify pins involved node before scheduling
+ kernfs_notify_work and kernfs_notify_workfn acquires node's
+ rwsem. Clauses in 3.1 ensure a consistent view of node state
+ throughout execution of work handler.
+
+ 3.2.j. mount
+
+ kernfs_fill_super, invoked during mount operation, acquires root
+ node's rwsem. During mount process there can't be other execution
+ contexts trying to move or delete the node so just locking the
+ involved node(i.e the root node) is enough.
+
+ 3.2.k. while activating a node
+
+ For a node that started as deactivated, kernfs_activate
+ activates the node. In this case acquiring node's rwsem is
+ enough. Since the node is not active yet any parallel removal
+ that wins the race for rwsem will skip this node and its
+ descendents. Also user space can't see a deactivated node so we
+ don't have any parallel access emanating from their as well.
+
+ 3.3 For operations that involve locking multiple nodes at the same time
+ locks are acquired in order of their addresses.
This document describes usage and proof of various hashed locks introduced in this patch set. Signed-off-by: Imran Khan <imran.f.khan@oracle.com> --- .../filesystems/kernfs-hashed-locks.rst | 214 ++++++++++++++++++ 1 file changed, 214 insertions(+) create mode 100644 Documentation/filesystems/kernfs-hashed-locks.rst