[40/41] mm: separate vma->lock from vm_area_struct

Message ID	20230109205336.3665937-41-surenb@google.com (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> Date: Mon, 9 Jan 2023 12:53:35 -0800 In-Reply-To: <20230109205336.3665937-1-surenb@google.com> Mime-Version: 1.0 References: <20230109205336.3665937-1-surenb@google.com> Message-ID: <20230109205336.3665937-41-surenb@google.com> Subject: [PATCH 40/41] mm: separate vma->lock from vm_area_struct From: Suren Baghdasaryan <surenb@google.com> To: akpm@linux-foundation.org Cc: michel@lespinasse.org, jglisse@google.com, mhocko@suse.com, vbabka@suse.cz, hannes@cmpxchg.org, mgorman@techsingularity.net, dave@stgolabs.net, willy@infradead.org, liam.howlett@oracle.com, peterz@infradead.org, ldufour@linux.ibm.com, laurent.dufour@fr.ibm.com, paulmck@kernel.org, luto@kernel.org, songliubraving@fb.com, peterx@redhat.com, david@redhat.com, dhowells@redhat.com, hughd@google.com, bigeasy@linutronix.de, kent.overstreet@linux.dev, punit.agrawal@bytedance.com, lstoakes@gmail.com, peterjung1337@gmail.com, rientjes@google.com, axelrasmussen@google.com, joelaf@google.com, minchan@google.com, jannh@google.com, shakeelb@google.com, tatashin@google.com, edumazet@google.com, gthelen@google.com, gurua@google.com, arjunroy@google.com, soheil@google.com, hughlynch@google.com, leewalsh@google.com, posk@google.com, linux-mm@kvack.org, linux-arm-kernel@lists.infradead.org, linuxppc-dev@lists.ozlabs.org, x86@kernel.org, linux-kernel@vger.kernel.org, kernel-team@android.com, surenb@google.com Content-Type: text/plain; charset="UTF-8" Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	Per-VMA locks \| expand [00/41] Per-VMA locks [01/41] maple_tree: Be more cautious about dead nodes [02/41] maple_tree: Detect dead nodes in mas_start() [03/41] maple_tree: Fix freeing of nodes in rcu mode [04/41] maple_tree: remove extra smp_wmb() from mas_dead_leaves() [05/41] maple_tree: Fix write memory barrier of nodes once dead for RCU mode [06/41] maple_tree: Add smp_rmb() to dead node detection [07/41] mm: Enable maple tree RCU mode by default. [08/41] mm: introduce CONFIG_PER_VMA_LOCK [09/41] mm: rcu safe VMA freeing [10/41] mm: move mmap_lock assert function definitions [11/41] mm: export dump_mm() [12/41] mm: add per-VMA lock and helper functions to control it [13/41] mm: introduce vma->vm_flags modifier functions [14/41] mm: replace VM_LOCKED_CLEAR_MASK with VM_LOCKED_MASK [15/41] mm: replace vma->vm_flags direct modifications with modifier calls [16/41] mm: replace vma->vm_flags indirect modification in ksm_madvise [17/41] mm/mmap: move VMA locking before anon_vma_lock_write call [18/41] mm/khugepaged: write-lock VMA while collapsing a huge page [19/41] mm/mmap: write-lock VMAs before merging, splitting or expanding them [20/41] mm/mmap: write-lock VMAs in vma_adjust [21/41] mm/mmap: write-lock VMAs affected by VMA expansion [22/41] mm/mremap: write-lock VMA while remapping it to a new address range [23/41] mm: write-lock VMAs before removing them from VMA tree [24/41] mm: conditionally write-lock VMA in free_pgtables [25/41] mm/mmap: write-lock adjacent VMAs if they can grow into unmapped area [26/41] kernel/fork: assert no VMA readers during its destruction [27/41] mm/mmap: prevent pagefault handler from racing with mmu_notifier registration [28/41] mm: introduce lock_vma_under_rcu to be used from arch-specific code [29/41] mm: fall back to mmap_lock if vma->anon_vma is not yet set [30/41] mm: add FAULT_FLAG_VMA_LOCK flag [31/41] mm: prevent do_swap_page from handling page faults under VMA lock [32/41] mm: prevent userfaults to be handled under per-vma lock [33/41] mm: introduce per-VMA lock statistics [34/41] x86/mm: try VMA lock-based page fault handling first [35/41] arm64/mm: try VMA lock-based page fault handling first [36/41] powerc/mm: try VMA lock-based page fault handling first [37/41] mm: introduce mod_vm_flags_nolock [38/41] mm: avoid assertion in untrack_pfn [39/41] kernel/fork: throttle call_rcu() calls in vm_area_free [40/41] mm: separate vma->lock from vm_area_struct [41/41] mm: replace rw_semaphore with atomic_t in vma_lock

Message ID

20230109205336.3665937-41-surenb@google.com (mailing list archive)

State

New

Headers

Date: Mon,  9 Jan 2023 12:53:35 -0800
In-Reply-To: <20230109205336.3665937-1-surenb@google.com>
Mime-Version: 1.0
References: <20230109205336.3665937-1-surenb@google.com>
Message-ID: <20230109205336.3665937-41-surenb@google.com>
Subject: [PATCH 40/41] mm: separate vma->lock from vm_area_struct
From: Suren Baghdasaryan <surenb@google.com>
To: akpm@linux-foundation.org
Cc: michel@lespinasse.org, jglisse@google.com, mhocko@suse.com,
 vbabka@suse.cz,
	hannes@cmpxchg.org, mgorman@techsingularity.net, dave@stgolabs.net,
	willy@infradead.org, liam.howlett@oracle.com, peterz@infradead.org,
	ldufour@linux.ibm.com, laurent.dufour@fr.ibm.com, paulmck@kernel.org,
	luto@kernel.org, songliubraving@fb.com, peterx@redhat.com, david@redhat.com,
	dhowells@redhat.com, hughd@google.com, bigeasy@linutronix.de,
	kent.overstreet@linux.dev, punit.agrawal@bytedance.com, lstoakes@gmail.com,
	peterjung1337@gmail.com, rientjes@google.com, axelrasmussen@google.com,
	joelaf@google.com, minchan@google.com, jannh@google.com, shakeelb@google.com,
	tatashin@google.com, edumazet@google.com, gthelen@google.com,
	gurua@google.com, arjunroy@google.com, soheil@google.com,
	hughlynch@google.com, leewalsh@google.com, posk@google.com,
	linux-mm@kvack.org, linux-arm-kernel@lists.infradead.org,
	linuxppc-dev@lists.ozlabs.org, x86@kernel.org, linux-kernel@vger.kernel.org,
	kernel-team@android.com, surenb@google.com
Content-Type: text/plain; charset="UTF-8"
Sender: owner-linux-mm@kvack.org
Precedence: bulk

Series

Per-VMA locks | expand

Commit Message

Suren Baghdasaryan Jan. 9, 2023, 8:53 p.m. UTC

vma->lock being part of the vm_area_struct causes performance regression
during page faults because during contention its count and owner fields
are constantly updated and having other parts of vm_area_struct used
during page fault handling next to them causes constant cache line
bouncing. Fix that by moving the lock outside of the vm_area_struct.
All attempts to keep vma->lock inside vm_area_struct in a separate
cache line still produce performance regression especially on NUMA
machines. Smallest regression was achieved when lock is placed in the
fourth cache line but that bloats vm_area_struct to 256 bytes.
Considering performance and memory impact, separate lock looks like
the best option. It increases memory footprint of each VMA but that
will be addressed in the next patch.
Note that after this change vma_init() does not allocate or
initialize vma->lock anymore. A number of drivers allocate a pseudo
VMA on the stack but they never use the VMA's lock, therefore it does
not need to be allocated. The future drivers which might need the VMA
lock should use vm_area_alloc()/vm_area_free() to allocate it.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 include/linux/mm.h       | 25 ++++++------
 include/linux/mm_types.h |  6 ++-
 kernel/fork.c            | 82 ++++++++++++++++++++++++++++------------
 3 files changed, 74 insertions(+), 39 deletions(-)

Comments

Jann Horn Jan. 17, 2023, 6:33 p.m. UTC | #1

On Mon, Jan 9, 2023 at 9:55 PM Suren Baghdasaryan <surenb@google.com> wrote:
> vma->lock being part of the vm_area_struct causes performance regression
> during page faults because during contention its count and owner fields
> are constantly updated and having other parts of vm_area_struct used
> during page fault handling next to them causes constant cache line
> bouncing. Fix that by moving the lock outside of the vm_area_struct.
> All attempts to keep vma->lock inside vm_area_struct in a separate
> cache line still produce performance regression especially on NUMA
> machines. Smallest regression was achieved when lock is placed in the
> fourth cache line but that bloats vm_area_struct to 256 bytes.

Just checking: When you tested putting the lock in different cache
lines, did you force the slab allocator to actually store the
vm_area_struct with cacheline alignment (by setting SLAB_HWCACHE_ALIGN
on the slab or with a ____cacheline_aligned_in_smp on the struct
definition)?

Suren Baghdasaryan Jan. 17, 2023, 7:01 p.m. UTC | #2

On Tue, Jan 17, 2023 at 10:34 AM Jann Horn <jannh@google.com> wrote:
>
> On Mon, Jan 9, 2023 at 9:55 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > vma->lock being part of the vm_area_struct causes performance regression
> > during page faults because during contention its count and owner fields
> > are constantly updated and having other parts of vm_area_struct used
> > during page fault handling next to them causes constant cache line
> > bouncing. Fix that by moving the lock outside of the vm_area_struct.
> > All attempts to keep vma->lock inside vm_area_struct in a separate
> > cache line still produce performance regression especially on NUMA
> > machines. Smallest regression was achieved when lock is placed in the
> > fourth cache line but that bloats vm_area_struct to 256 bytes.
>
> Just checking: When you tested putting the lock in different cache
> lines, did you force the slab allocator to actually store the
> vm_area_struct with cacheline alignment (by setting SLAB_HWCACHE_ALIGN
> on the slab or with a ____cacheline_aligned_in_smp on the struct
> definition)?

Yep, I tried all these combinations and still saw noticeable regression.

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 50c7a6dd9c7a..d40bf8a5e19e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -615,11 +615,6 @@  struct vm_operations_struct {
 };
 
 #ifdef CONFIG_PER_VMA_LOCK
-static inline void vma_init_lock(struct vm_area_struct *vma)
-{
-	init_rwsem(&vma->lock);
-	vma->vm_lock_seq = -1;
-}
 
 static inline void vma_write_lock(struct vm_area_struct *vma)
 {
@@ -635,9 +630,9 @@  static inline void vma_write_lock(struct vm_area_struct *vma)
 	if (vma->vm_lock_seq == mm_lock_seq)
 		return;
 
-	down_write(&vma->lock);
+	down_write(&vma->vm_lock->lock);
 	vma->vm_lock_seq = mm_lock_seq;
-	up_write(&vma->lock);
+	up_write(&vma->vm_lock->lock);
 }
 
 /*
@@ -651,17 +646,17 @@  static inline bool vma_read_trylock(struct vm_area_struct *vma)
 	if (vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
 		return false;
 
-	if (unlikely(down_read_trylock(&vma->lock) == 0))
+	if (unlikely(down_read_trylock(&vma->vm_lock->lock) == 0))
 		return false;
 
 	/*
 	 * Overflow might produce false locked result.
 	 * False unlocked result is impossible because we modify and check
-	 * vma->vm_lock_seq under vma->lock protection and mm->mm_lock_seq
+	 * vma->vm_lock_seq under vma->vm_lock protection and mm->mm_lock_seq
 	 * modification invalidates all existing locks.
 	 */
 	if (unlikely(vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))) {
-		up_read(&vma->lock);
+		up_read(&vma->vm_lock->lock);
 		return false;
 	}
 	return true;
@@ -669,7 +664,7 @@  static inline bool vma_read_trylock(struct vm_area_struct *vma)
 
 static inline void vma_read_unlock(struct vm_area_struct *vma)
 {
-	up_read(&vma->lock);
+	up_read(&vma->vm_lock->lock);
 }
 
 static inline void vma_assert_write_locked(struct vm_area_struct *vma)
@@ -684,7 +679,7 @@  static inline void vma_assert_write_locked(struct vm_area_struct *vma)
 
 static inline void vma_assert_no_reader(struct vm_area_struct *vma)
 {
-	VM_BUG_ON_VMA(rwsem_is_locked(&vma->lock) &&
+	VM_BUG_ON_VMA(rwsem_is_locked(&vma->vm_lock->lock) &&
 		      vma->vm_lock_seq != READ_ONCE(vma->vm_mm->mm_lock_seq),
 		      vma);
 }
@@ -694,7 +689,6 @@  struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
 
 #else /* CONFIG_PER_VMA_LOCK */
 
-static inline void vma_init_lock(struct vm_area_struct *vma) {}
 static inline void vma_write_lock(struct vm_area_struct *vma) {}
 static inline bool vma_read_trylock(struct vm_area_struct *vma)
 		{ return false; }
@@ -704,6 +698,10 @@  static inline void vma_assert_no_reader(struct vm_area_struct *vma) {}
 
 #endif /* CONFIG_PER_VMA_LOCK */
 
+/*
+ * WARNING: vma_init does not initialize vma->vm_lock.
+ * Use vm_area_alloc()/vm_area_free() if vma needs locking.
+ */
 static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
 {
 	static const struct vm_operations_struct dummy_vm_ops = {};
@@ -712,7 +710,6 @@  static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
 	vma->vm_mm = mm;
 	vma->vm_ops = &dummy_vm_ops;
 	INIT_LIST_HEAD(&vma->anon_vma_chain);
-	vma_init_lock(vma);
 }
 
 /* Use when VMA is not part of the VMA tree and needs no locking */
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index c0e6c8e4700b..faa61b400f9b 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -526,6 +526,10 @@  struct anon_vma_name {
 	char name[];
 };
 
+struct vma_lock {
+	struct rw_semaphore lock;
+};
+
 /*
  * This struct describes a virtual memory area. There is one of these
  * per VM-area/task. A VM area is any part of the process virtual memory
@@ -563,7 +567,7 @@  struct vm_area_struct {
 
 #ifdef CONFIG_PER_VMA_LOCK
 	int vm_lock_seq;
-	struct rw_semaphore lock;
+	struct vma_lock *vm_lock;
 #endif
 
 	/*
diff --git a/kernel/fork.c b/kernel/fork.c
index 97f2b751f88d..95db6a521cf1 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -451,40 +451,28 @@  static struct kmem_cache *vm_area_cachep;
 /* SLAB cache for mm_struct structures (tsk->mm) */
 static struct kmem_cache *mm_cachep;
 
-struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
-{
-	struct vm_area_struct *vma;
+#ifdef CONFIG_PER_VMA_LOCK
 
-	vma = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
-	if (vma)
-		vma_init(vma, mm);
-	return vma;
-}
+/* SLAB cache for vm_area_struct.lock */
+static struct kmem_cache *vma_lock_cachep;
 
-struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
+static bool vma_init_lock(struct vm_area_struct *vma)
 {
-	struct vm_area_struct *new = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
+	vma->vm_lock = kmem_cache_alloc(vma_lock_cachep, GFP_KERNEL);
+	if (!vma->vm_lock)
+		return false;
 
-	if (new) {
-		ASSERT_EXCLUSIVE_WRITER(orig->vm_flags);
-		ASSERT_EXCLUSIVE_WRITER(orig->vm_file);
-		/*
-		 * orig->shared.rb may be modified concurrently, but the clone
-		 * will be reinitialized.
-		 */
-		*new = data_race(*orig);
-		INIT_LIST_HEAD(&new->anon_vma_chain);
-		vma_init_lock(new);
-		dup_anon_vma_name(orig, new);
-	}
-	return new;
+	init_rwsem(&vma->vm_lock->lock);
+	vma->vm_lock_seq = -1;
+
+	return true;
 }
 
-#ifdef CONFIG_PER_VMA_LOCK
 static inline void __vm_area_free(struct vm_area_struct *vma)
 {
 	/* The vma should either have no lock holders or be write-locked. */
 	vma_assert_no_reader(vma);
+	kmem_cache_free(vma_lock_cachep, vma->vm_lock);
 	kmem_cache_free(vm_area_cachep, vma);
 }
 
@@ -540,6 +528,7 @@  void vm_area_free(struct vm_area_struct *vma)
 
 #else /* CONFIG_PER_VMA_LOCK */
 
+static bool vma_init_lock(struct vm_area_struct *vma) { return true; }
 void drain_free_vmas(struct mm_struct *mm) {}
 
 void vm_area_free(struct vm_area_struct *vma)
@@ -550,6 +539,48 @@  void vm_area_free(struct vm_area_struct *vma)
 
 #endif /* CONFIG_PER_VMA_LOCK */
 
+struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
+{
+	struct vm_area_struct *vma;
+
+	vma = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
+	if (!vma)
+		return NULL;
+
+	vma_init(vma, mm);
+	if (!vma_init_lock(vma)) {
+		kmem_cache_free(vm_area_cachep, vma);
+		return NULL;
+	}
+
+	return vma;
+}
+
+struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
+{
+	struct vm_area_struct *new;
+
+	new = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
+	if (!new)
+		return NULL;
+
+	ASSERT_EXCLUSIVE_WRITER(orig->vm_flags);
+	ASSERT_EXCLUSIVE_WRITER(orig->vm_file);
+	/*
+	 * orig->shared.rb may be modified concurrently, but the clone
+	 * will be reinitialized.
+	 */
+	*new = data_race(*orig);
+	if (!vma_init_lock(new)) {
+		kmem_cache_free(vm_area_cachep, new);
+		return NULL;
+	}
+	INIT_LIST_HEAD(&new->anon_vma_chain);
+	dup_anon_vma_name(orig, new);
+
+	return new;
+}
+
 static void account_kernel_stack(struct task_struct *tsk, int account)
 {
 	if (IS_ENABLED(CONFIG_VMAP_STACK)) {
@@ -3138,6 +3169,9 @@  void __init proc_caches_init(void)
 			NULL);
 
 	vm_area_cachep = KMEM_CACHE(vm_area_struct, SLAB_PANIC|SLAB_ACCOUNT);
+#ifdef CONFIG_PER_VMA_LOCK
+	vma_lock_cachep = KMEM_CACHE(vma_lock, SLAB_PANIC|SLAB_ACCOUNT);
+#endif
 	mmap_init();
 	nsproxy_cache_init();
 }

[40/41] mm: separate vma->lock from vm_area_struct

Commit Message

Comments

Patch