[v5,4/6] mm: make vma cache SLAB_TYPESAFE_BY_RCU

Message ID	20241206225204.4008261-5-surenb@google.com (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> Date: Fri, 6 Dec 2024 14:52:01 -0800 In-Reply-To: <20241206225204.4008261-1-surenb@google.com> Mime-Version: 1.0 References: <20241206225204.4008261-1-surenb@google.com> Message-ID: <20241206225204.4008261-5-surenb@google.com> Subject: [PATCH v5 4/6] mm: make vma cache SLAB_TYPESAFE_BY_RCU From: Suren Baghdasaryan <surenb@google.com> To: akpm@linux-foundation.org Cc: willy@infradead.org, liam.howlett@oracle.com, lorenzo.stoakes@oracle.com, mhocko@suse.com, vbabka@suse.cz, hannes@cmpxchg.org, mjguzik@gmail.com, oliver.sang@intel.com, mgorman@techsingularity.net, david@redhat.com, peterx@redhat.com, oleg@redhat.com, dave@stgolabs.net, paulmck@kernel.org, brauner@kernel.org, dhowells@redhat.com, hdanton@sina.com, hughd@google.com, minchan@google.com, jannh@google.com, shakeel.butt@linux.dev, souravpanda@google.com, pasha.tatashin@soleen.com, corbet@lwn.net, linux-doc@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, kernel-team@android.com, surenb@google.com Content-Type: text/plain; charset="UTF-8" Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	move per-vma lock into vm_area_struct \| expand [v5,0/6] move per-vma lock into vm_area_struct [v5,1/6] mm: introduce vma_start_read_locked{_nested} helpers [v5,2/6] mm: move per-vma lock into vm_area_struct [v5,3/6] mm: mark vma as detached until it's added into vma tree [v5,4/6] mm: make vma cache SLAB_TYPESAFE_BY_RCU [v5,5/6] mm/slab: allow freeptr_offset to be used with ctor [v5,6/6] docs/mm: document latest changes to vm_lock

Suren Baghdasaryan Dec. 6, 2024, 10:52 p.m. UTC

To enable SLAB_TYPESAFE_BY_RCU for vma cache we need to ensure that
object reuse before RCU grace period is over will be detected inside
lock_vma_under_rcu().
lock_vma_under_rcu() enters RCU read section, finds the vma at the
given address, locks the vma and checks if it got detached or remapped
to cover a different address range. These last checks are there
to ensure that the vma was not modified after we found it but before
locking it.
vma reuse introduces several new possibilities:
1. vma can be reused after it was found but before it is locked;
2. vma can be reused and reinitialized (including changing its vm_mm)
while being locked in vma_start_read();
3. vma can be reused and reinitialized after it was found but before
it is locked, then attached at a new address or to a new mm while
read-locked;
For case #1 current checks will help detecting cases when:
- vma was reused but not yet added into the tree (detached check)
- vma was reused at a different address range (address check);
We are missing the check for vm_mm to ensure the reused vma was not
attached to a different mm. This patch adds the missing check.
For case #2, we pass mm to vma_start_read() to prevent access to
unstable vma->vm_mm. This might lead to vma_start_read() returning
a false locked result but that's not critical if it's rare because
it will only lead to a retry under mmap_lock.
For case #3, we ensure the order in which vma->detached flag and
vm_start/vm_end/vm_mm are set and checked. vma gets attached after
vm_start/vm_end/vm_mm were set and lock_vma_under_rcu() should check
vma->detached before checking vm_start/vm_end/vm_mm. This is required
because attaching vma happens without vma write-lock, as opposed to
vma detaching, which requires vma write-lock. This patch adds memory
barriers inside is_vma_detached() and vma_mark_attached() needed to
order reads and writes to vma->detached vs vm_start/vm_end/vm_mm.
After these provisions, SLAB_TYPESAFE_BY_RCU is added to vm_area_cachep.
This will facilitate vm_area_struct reuse and will minimize the number
of call_rcu() calls.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 include/linux/mm.h               |  36 +++++--
 include/linux/mm_types.h         |  10 +-
 include/linux/slab.h             |   6 --
 kernel/fork.c                    | 157 +++++++++++++++++++++++++------
 mm/memory.c                      |  15 ++-
 mm/vma.c                         |   2 +-
 tools/testing/vma/vma_internal.h |   7 +-
 7 files changed, 179 insertions(+), 54 deletions(-)

Klara Modin Dec. 9, 2024, 5:35 p.m. UTC | #1

Hi,

On 2024-12-06 23:52, Suren Baghdasaryan wrote:
> To enable SLAB_TYPESAFE_BY_RCU for vma cache we need to ensure that
> object reuse before RCU grace period is over will be detected inside
> lock_vma_under_rcu().
> lock_vma_under_rcu() enters RCU read section, finds the vma at the
> given address, locks the vma and checks if it got detached or remapped
> to cover a different address range. These last checks are there
> to ensure that the vma was not modified after we found it but before
> locking it.
> vma reuse introduces several new possibilities:
> 1. vma can be reused after it was found but before it is locked;
> 2. vma can be reused and reinitialized (including changing its vm_mm)
> while being locked in vma_start_read();
> 3. vma can be reused and reinitialized after it was found but before
> it is locked, then attached at a new address or to a new mm while
> read-locked;
> For case #1 current checks will help detecting cases when:
> - vma was reused but not yet added into the tree (detached check)
> - vma was reused at a different address range (address check);
> We are missing the check for vm_mm to ensure the reused vma was not
> attached to a different mm. This patch adds the missing check.
> For case #2, we pass mm to vma_start_read() to prevent access to
> unstable vma->vm_mm. This might lead to vma_start_read() returning
> a false locked result but that's not critical if it's rare because
> it will only lead to a retry under mmap_lock.
> For case #3, we ensure the order in which vma->detached flag and
> vm_start/vm_end/vm_mm are set and checked. vma gets attached after
> vm_start/vm_end/vm_mm were set and lock_vma_under_rcu() should check
> vma->detached before checking vm_start/vm_end/vm_mm. This is required
> because attaching vma happens without vma write-lock, as opposed to
> vma detaching, which requires vma write-lock. This patch adds memory
> barriers inside is_vma_detached() and vma_mark_attached() needed to
> order reads and writes to vma->detached vs vm_start/vm_end/vm_mm.
> After these provisions, SLAB_TYPESAFE_BY_RCU is added to vm_area_cachep.
> This will facilitate vm_area_struct reuse and will minimize the number
> of call_rcu() calls.

This patch (85ad413389aec04cfaaba043caa8128b76c6e491 in next-20241209) 
seems to cause an oops on a MIPS board of mine (Cavium Octeon III) 
(abbreviated, full attached):

CPU 2 Unable to handle kernel paging request at virtual address 
0000000000000000, epc == ffffffff813a85a0, ra == ffffffff81390438
Oops[#1]:
CPU: 2 UID: 0 PID: 1 Comm: init Not tainted 
6.13.0-rc1-00162-g85ad413389ae #156
Call Trace:
unlink_anon_vmas (mm/rmap.c:408)
free_pgtables (mm/memory.c:393)
vms_clear_ptes (mm/vma.c:1143)
vms_complete_munmap_vmas (include/linux/mm.h:2737 mm/vma.c:1187)
do_vmi_align_munmap (mm/vma.c:1452)
__vm_munmap (mm/vma.c:2892)
sys_munmap (mm/mmap.c:1053)
syscall_common (arch/mips/kernel/scall64-n64.S:62)

I saw that there's already a report, but maybe another arch can be 
useful for tracking this down.

Please let me know if there's anything else you need.

Regards,
Klara Modin

Link: https://lore.kernel.org/all/202412082208.db1fb2c9-lkp@intel.com

> 
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> ---
>   include/linux/mm.h               |  36 +++++--
>   include/linux/mm_types.h         |  10 +-
>   include/linux/slab.h             |   6 --
>   kernel/fork.c                    | 157 +++++++++++++++++++++++++------
>   mm/memory.c                      |  15 ++-
>   mm/vma.c                         |   2 +-
>   tools/testing/vma/vma_internal.h |   7 +-
>   7 files changed, 179 insertions(+), 54 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 2bf38c1e9cca..3568bcbc7c81 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -257,7 +257,7 @@ struct vm_area_struct *vm_area_alloc(struct mm_struct *);
>   struct vm_area_struct *vm_area_dup(struct vm_area_struct *);
>   void vm_area_free(struct vm_area_struct *);
>   /* Use only if VMA has no other users */
> -void __vm_area_free(struct vm_area_struct *vma);
> +void vm_area_free_unreachable(struct vm_area_struct *vma);
>   
>   #ifndef CONFIG_MMU
>   extern struct rb_root nommu_region_tree;
> @@ -706,8 +706,10 @@ static inline void vma_lock_init(struct vm_area_struct *vma)
>    * Try to read-lock a vma. The function is allowed to occasionally yield false
>    * locked result to avoid performance overhead, in which case we fall back to
>    * using mmap_lock. The function should never yield false unlocked result.
> + * False locked result is possible if mm_lock_seq overflows or if vma gets
> + * reused and attached to a different mm before we lock it.
>    */
> -static inline bool vma_start_read(struct vm_area_struct *vma)
> +static inline bool vma_start_read(struct mm_struct *mm, struct vm_area_struct *vma)
>   {
>   	/*
>   	 * Check before locking. A race might cause false locked result.
> @@ -716,7 +718,7 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
>   	 * we don't rely on for anything - the mm_lock_seq read against which we
>   	 * need ordering is below.
>   	 */
> -	if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(vma->vm_mm->mm_lock_seq.sequence))
> +	if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(mm->mm_lock_seq.sequence))
>   		return false;
>   
>   	if (unlikely(down_read_trylock(&vma->vm_lock.lock) == 0))
> @@ -733,7 +735,7 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
>   	 * after it has been unlocked.
>   	 * This pairs with RELEASE semantics in vma_end_write_all().
>   	 */
> -	if (unlikely(vma->vm_lock_seq == raw_read_seqcount(&vma->vm_mm->mm_lock_seq))) {
> +	if (unlikely(vma->vm_lock_seq == raw_read_seqcount(&mm->mm_lock_seq))) {
>   		up_read(&vma->vm_lock.lock);
>   		return false;
>   	}
> @@ -822,7 +824,15 @@ static inline void vma_assert_locked(struct vm_area_struct *vma)
>   
>   static inline void vma_mark_attached(struct vm_area_struct *vma)
>   {
> -	vma->detached = false;
> +	/*
> +	 * This pairs with smp_rmb() inside is_vma_detached().
> +	 * vma is marked attached after all vma modifications are done and it
> +	 * got added into the vma tree. All prior vma modifications should be
> +	 * made visible before marking the vma attached.
> +	 */
> +	smp_wmb();
> +	/* This pairs with READ_ONCE() in is_vma_detached(). */
> +	WRITE_ONCE(vma->detached, false);
>   }
>   
>   static inline void vma_mark_detached(struct vm_area_struct *vma)
> @@ -834,7 +844,18 @@ static inline void vma_mark_detached(struct vm_area_struct *vma)
>   
>   static inline bool is_vma_detached(struct vm_area_struct *vma)
>   {
> -	return vma->detached;
> +	bool detached;
> +
> +	/* This pairs with WRITE_ONCE() in vma_mark_attached(). */
> +	detached = READ_ONCE(vma->detached);
> +	/*
> +	 * This pairs with smp_wmb() inside vma_mark_attached() to ensure
> +	 * vma->detached is read before vma attributes read later inside
> +	 * lock_vma_under_rcu().
> +	 */
> +	smp_rmb();
> +
> +	return detached;
>   }
>   
>   static inline void release_fault_lock(struct vm_fault *vmf)
> @@ -859,7 +880,7 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
>   #else /* CONFIG_PER_VMA_LOCK */
>   
>   static inline void vma_lock_init(struct vm_area_struct *vma) {}
> -static inline bool vma_start_read(struct vm_area_struct *vma)
> +static inline bool vma_start_read(struct mm_struct *mm, struct vm_area_struct *vma)
>   		{ return false; }
>   static inline void vma_end_read(struct vm_area_struct *vma) {}
>   static inline void vma_start_write(struct vm_area_struct *vma) {}
> @@ -893,6 +914,7 @@ static inline void assert_fault_locked(struct vm_fault *vmf)
>   
>   extern const struct vm_operations_struct vma_dummy_vm_ops;
>   
> +/* Use on VMAs not created using vm_area_alloc() */
>   static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
>   {
>   	memset(vma, 0, sizeof(*vma));
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index be3551654325..5d8779997266 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -543,6 +543,12 @@ static inline void *folio_get_private(struct folio *folio)
>   
>   typedef unsigned long vm_flags_t;
>   
> +/*
> + * freeptr_t represents a SLUB freelist pointer, which might be encoded
> + * and not dereferenceable if CONFIG_SLAB_FREELIST_HARDENED is enabled.
> + */
> +typedef struct { unsigned long v; } freeptr_t;
> +
>   /*
>    * A region containing a mapping of a non-memory backed file under NOMMU
>    * conditions.  These are held in a global tree and are pinned by the VMAs that
> @@ -657,9 +663,7 @@ struct vm_area_struct {
>   			unsigned long vm_start;
>   			unsigned long vm_end;
>   		};
> -#ifdef CONFIG_PER_VMA_LOCK
> -		struct rcu_head vm_rcu;	/* Used for deferred freeing. */
> -#endif
> +		freeptr_t vm_freeptr; /* Pointer used by SLAB_TYPESAFE_BY_RCU */
>   	};
>   
>   	/*
> diff --git a/include/linux/slab.h b/include/linux/slab.h
> index 10a971c2bde3..681b685b6c4e 100644
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -234,12 +234,6 @@ enum _slab_flag_bits {
>   #define SLAB_NO_OBJ_EXT		__SLAB_FLAG_UNUSED
>   #endif
>   
> -/*
> - * freeptr_t represents a SLUB freelist pointer, which might be encoded
> - * and not dereferenceable if CONFIG_SLAB_FREELIST_HARDENED is enabled.
> - */
> -typedef struct { unsigned long v; } freeptr_t;
> -
>   /*
>    * ZERO_SIZE_PTR will be returned for zero sized kmalloc requests.
>    *
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 71990f46aa4e..e7e76a660e4c 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -436,6 +436,98 @@ static struct kmem_cache *vm_area_cachep;
>   /* SLAB cache for mm_struct structures (tsk->mm) */
>   static struct kmem_cache *mm_cachep;
>   
> +static void vm_area_ctor(void *data)
> +{
> +	struct vm_area_struct *vma = (struct vm_area_struct *)data;
> +
> +#ifdef CONFIG_PER_VMA_LOCK
> +	/* vma is not locked, can't use vma_mark_detached() */
> +	vma->detached = true;
> +#endif
> +	INIT_LIST_HEAD(&vma->anon_vma_chain);
> +	vma_lock_init(vma);
> +}
> +
> +#ifdef CONFIG_PER_VMA_LOCK
> +
> +static void vma_clear(struct vm_area_struct *vma, struct mm_struct *mm)
> +{
> +	vma->vm_mm = mm;
> +	vma->vm_ops = &vma_dummy_vm_ops;
> +	vma->vm_start = 0;
> +	vma->vm_end = 0;
> +	vma->anon_vma = NULL;
> +	vma->vm_pgoff = 0;
> +	vma->vm_file = NULL;
> +	vma->vm_private_data = NULL;
> +	vm_flags_init(vma, 0);
> +	memset(&vma->vm_page_prot, 0, sizeof(vma->vm_page_prot));
> +	memset(&vma->shared, 0, sizeof(vma->shared));
> +	memset(&vma->vm_userfaultfd_ctx, 0, sizeof(vma->vm_userfaultfd_ctx));
> +	vma_numab_state_init(vma);
> +#ifdef CONFIG_ANON_VMA_NAME
> +	vma->anon_name = NULL;
> +#endif
> +#ifdef CONFIG_SWAP
> +	memset(&vma->swap_readahead_info, 0, sizeof(vma->swap_readahead_info));
> +#endif
> +#ifndef CONFIG_MMU
> +	vma->vm_region = NULL;
> +#endif
> +#ifdef CONFIG_NUMA
> +	vma->vm_policy = NULL;
> +#endif
> +}
> +
> +static void vma_copy(const struct vm_area_struct *src, struct vm_area_struct *dest)
> +{
> +	dest->vm_mm = src->vm_mm;
> +	dest->vm_ops = src->vm_ops;
> +	dest->vm_start = src->vm_start;
> +	dest->vm_end = src->vm_end;
> +	dest->anon_vma = src->anon_vma;
> +	dest->vm_pgoff = src->vm_pgoff;
> +	dest->vm_file = src->vm_file;
> +	dest->vm_private_data = src->vm_private_data;
> +	vm_flags_init(dest, src->vm_flags);
> +	memcpy(&dest->vm_page_prot, &src->vm_page_prot,
> +	       sizeof(dest->vm_page_prot));
> +	memcpy(&dest->shared, &src->shared, sizeof(dest->shared));
> +	memcpy(&dest->vm_userfaultfd_ctx, &src->vm_userfaultfd_ctx,
> +	       sizeof(dest->vm_userfaultfd_ctx));
> +#ifdef CONFIG_ANON_VMA_NAME
> +	dest->anon_name = src->anon_name;
> +#endif
> +#ifdef CONFIG_SWAP
> +	memcpy(&dest->swap_readahead_info, &src->swap_readahead_info,
> +	       sizeof(dest->swap_readahead_info));
> +#endif
> +#ifndef CONFIG_MMU
> +	dest->vm_region = src->vm_region;
> +#endif
> +#ifdef CONFIG_NUMA
> +	dest->vm_policy = src->vm_policy;
> +#endif
> +}
> +
> +#else /* CONFIG_PER_VMA_LOCK */
> +
> +static void vma_clear(struct vm_area_struct *vma, struct mm_struct *mm)
> +{
> +	vma_init(vma, mm);
> +}
> +
> +static void vma_copy(const struct vm_area_struct *src, struct vm_area_struct *dest)
> +{
> +	/*
> +	 * orig->shared.rb may be modified concurrently, but the clone
> +	 * will be reinitialized.
> +	 */
> +	data_race(memcpy(dest, src, sizeof(*dest)));
> +}
> +
> +#endif /* CONFIG_PER_VMA_LOCK */
> +
>   struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
>   {
>   	struct vm_area_struct *vma;
> @@ -444,7 +536,7 @@ struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
>   	if (!vma)
>   		return NULL;
>   
> -	vma_init(vma, mm);
> +	vma_clear(vma, mm);
>   
>   	return vma;
>   }
> @@ -458,49 +550,46 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
>   
>   	ASSERT_EXCLUSIVE_WRITER(orig->vm_flags);
>   	ASSERT_EXCLUSIVE_WRITER(orig->vm_file);
> -	/*
> -	 * orig->shared.rb may be modified concurrently, but the clone
> -	 * will be reinitialized.
> -	 */
> -	data_race(memcpy(new, orig, sizeof(*new)));
> -	vma_lock_init(new);
> -	INIT_LIST_HEAD(&new->anon_vma_chain);
> -#ifdef CONFIG_PER_VMA_LOCK
> -	/* vma is not locked, can't use vma_mark_detached() */
> -	new->detached = true;
> -#endif
> +	vma_copy(orig, new);
>   	vma_numab_state_init(new);
>   	dup_anon_vma_name(orig, new);
>   
>   	return new;
>   }
>   
> -void __vm_area_free(struct vm_area_struct *vma)
> +static void __vm_area_free(struct vm_area_struct *vma, bool unreachable)
>   {
> +#ifdef CONFIG_PER_VMA_LOCK
> +	/*
> +	 * With SLAB_TYPESAFE_BY_RCU, vma can be reused and we need
> +	 * vma->detached to be set before vma is returned into the cache.
> +	 * This way reused object won't be used by readers until it's
> +	 * initialized and reattached.
> +	 * If vma is unreachable, there can be no other users and we
> +	 * can set vma->detached directly with no risk of a race.
> +	 * If vma is reachable, then it should have been already detached
> +	 * under vma write-lock or it was never attached.
> +	 */
> +	if (unreachable)
> +		vma->detached = true;
> +	else
> +		VM_BUG_ON_VMA(!is_vma_detached(vma), vma);
> +	vma->vm_lock_seq = UINT_MAX;
> +#endif
> +	VM_BUG_ON_VMA(!list_empty(&vma->anon_vma_chain), vma);
>   	vma_numab_state_free(vma);
>   	free_anon_vma_name(vma);
>   	kmem_cache_free(vm_area_cachep, vma);
>   }
>   
> -#ifdef CONFIG_PER_VMA_LOCK
> -static void vm_area_free_rcu_cb(struct rcu_head *head)
> +void vm_area_free(struct vm_area_struct *vma)
>   {
> -	struct vm_area_struct *vma = container_of(head, struct vm_area_struct,
> -						  vm_rcu);
> -
> -	/* The vma should not be locked while being destroyed. */
> -	VM_BUG_ON_VMA(rwsem_is_locked(&vma->vm_lock.lock), vma);
> -	__vm_area_free(vma);
> +	__vm_area_free(vma, false);
>   }
> -#endif
>   
> -void vm_area_free(struct vm_area_struct *vma)
> +void vm_area_free_unreachable(struct vm_area_struct *vma)
>   {
> -#ifdef CONFIG_PER_VMA_LOCK
> -	call_rcu(&vma->vm_rcu, vm_area_free_rcu_cb);
> -#else
> -	__vm_area_free(vma);
> -#endif
> +	__vm_area_free(vma, true);
>   }
>   
>   static void account_kernel_stack(struct task_struct *tsk, int account)
> @@ -3141,6 +3230,12 @@ void __init mm_cache_init(void)
>   
>   void __init proc_caches_init(void)
>   {
> +	struct kmem_cache_args args = {
> +		.use_freeptr_offset = true,
> +		.freeptr_offset = offsetof(struct vm_area_struct, vm_freeptr),
> +		.ctor = vm_area_ctor,
> +	};
> +
>   	sighand_cachep = kmem_cache_create("sighand_cache",
>   			sizeof(struct sighand_struct), 0,
>   			SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_TYPESAFE_BY_RCU|
> @@ -3157,9 +3252,11 @@ void __init proc_caches_init(void)
>   			sizeof(struct fs_struct), 0,
>   			SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT,
>   			NULL);
> -	vm_area_cachep = KMEM_CACHE(vm_area_struct,
> -			SLAB_HWCACHE_ALIGN|SLAB_NO_MERGE|SLAB_PANIC|
> +	vm_area_cachep = kmem_cache_create("vm_area_struct",
> +			sizeof(struct vm_area_struct), &args,
> +			SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_TYPESAFE_BY_RCU|
>   			SLAB_ACCOUNT);
> +
>   	mmap_init();
>   	nsproxy_cache_init();
>   }
> diff --git a/mm/memory.c b/mm/memory.c
> index b252f19b28c9..6f4d4d423835 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -6368,10 +6368,16 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
>   	if (!vma)
>   		goto inval;
>   
> -	if (!vma_start_read(vma))
> +	if (!vma_start_read(mm, vma))
>   		goto inval;
>   
> -	/* Check if the VMA got isolated after we found it */
> +	/*
> +	 * Check if the VMA got isolated after we found it.
> +	 * Note: vma we found could have been recycled and is being reattached.
> +	 * It's possible to attach a vma while it is read-locked, however a
> +	 * read-locked vma can't be detached (detaching requires write-locking).
> +	 * Therefore if this check passes, we have an attached and stable vma.
> +	 */
>   	if (is_vma_detached(vma)) {
>   		vma_end_read(vma);
>   		count_vm_vma_lock_event(VMA_LOCK_MISS);
> @@ -6385,8 +6391,9 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
>   	 * fields are accessible for RCU readers.
>   	 */
>   
> -	/* Check since vm_start/vm_end might change before we lock the VMA */
> -	if (unlikely(address < vma->vm_start || address >= vma->vm_end))
> +	/* Check if the vma we locked is the right one. */
> +	if (unlikely(vma->vm_mm != mm ||
> +		     address < vma->vm_start || address >= vma->vm_end))
>   		goto inval_end_read;
>   
>   	rcu_read_unlock();
> diff --git a/mm/vma.c b/mm/vma.c
> index cdc63728f47f..648784416833 100644
> --- a/mm/vma.c
> +++ b/mm/vma.c
> @@ -414,7 +414,7 @@ void remove_vma(struct vm_area_struct *vma, bool unreachable)
>   		fput(vma->vm_file);
>   	mpol_put(vma_policy(vma));
>   	if (unreachable)
> -		__vm_area_free(vma);
> +		vm_area_free_unreachable(vma);
>   	else
>   		vm_area_free(vma);
>   }
> diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
> index 0cdc5f8c3d60..3eeb1317cc69 100644
> --- a/tools/testing/vma/vma_internal.h
> +++ b/tools/testing/vma/vma_internal.h
> @@ -685,14 +685,15 @@ static inline void mpol_put(struct mempolicy *)
>   {
>   }
>   
> -static inline void __vm_area_free(struct vm_area_struct *vma)
> +static inline void vm_area_free(struct vm_area_struct *vma)
>   {
>   	free(vma);
>   }
>   
> -static inline void vm_area_free(struct vm_area_struct *vma)
> +static inline void vm_area_free_unreachable(struct vm_area_struct *vma)
>   {
> -	__vm_area_free(vma);
> +	vma->detached = true;
> +	vm_area_free(vma);
>   }
>   
>   static inline void lru_add_drain(void)
# bad: [7e0a79eb1674cbe8916bbc7d7f9f6bfb409dfb4c] commit 0790303ec869 leads to cpu stall without CONFIG_FANOTIFY_ACCESS_PERMISSIONS=y
git bisect start 'HEAD'
# status: waiting for good commit(s), bad commit known
# good: [fac04efc5c793dccbd07e2d59af9f90b7fc0dca4] Linux 6.13-rc2
git bisect good fac04efc5c793dccbd07e2d59af9f90b7fc0dca4
# bad: [06036a53ccaf22767bbd5caa491c52bb673604e6] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git
git bisect bad 06036a53ccaf22767bbd5caa491c52bb673604e6
# bad: [ddaabea9ec4322b7d5cdd2017c4cc59b28adb8fb] Merge branch 'linux-next' of git://github.com/c-sky/csky-linux.git
git bisect bad ddaabea9ec4322b7d5cdd2017c4cc59b28adb8fb
# good: [5de9573848aa512bb06e8842faf5da6926e693bd] Merge branch 'tip/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git
git bisect good 5de9573848aa512bb06e8842faf5da6926e693bd
# bad: [8bb435707c07ff7650fe8ef690700845c772d3cf] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/amlogic/linux.git
git bisect bad 8bb435707c07ff7650fe8ef690700845c772d3cf
# good: [7e274b34309bb7884cade55b7cb0b0b8722dcde2] mm: do_zap_pte_range: return any_skipped information to the caller
git bisect good 7e274b34309bb7884cade55b7cb0b0b8722dcde2
# skip: [6e165f54437931f329d09dca6c19d99af08a36e1] mm/page_isolation: fixup isolate_single_pageblock() comment regarding splitting free pages
git bisect skip 6e165f54437931f329d09dca6c19d99af08a36e1
# good: [c81cad99b0b13b0bb339a8e30aa13ebc1e5dbff3] minmax.h: update some comments
git bisect good c81cad99b0b13b0bb339a8e30aa13ebc1e5dbff3
# good: [191812969871e9ae447cf3e7426fb72b351f1336] scripts/spelling.txt: add more spellings to spelling.txt
git bisect good 191812969871e9ae447cf3e7426fb72b351f1336
# good: [f635cc53910d4ba7726f925e49ce57c5332cd6a1] ucounts: move kfree() out of critical zone protected by ucounts_lock
git bisect good f635cc53910d4ba7726f925e49ce57c5332cd6a1
# skip: [4f3ce240fd4bd7681335231b8cf21fdc37c4e848] mm/page_alloc: conditionally split > pageblock_order pages in free_one_page() and move_freepages_block_isolate()
git bisect skip 4f3ce240fd4bd7681335231b8cf21fdc37c4e848
# bad: [53b2a428e1036e481af0c7fc39cb92128344b689] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild.git
git bisect bad 53b2a428e1036e481af0c7fc39cb92128344b689
# good: [1062a24d0a17862738df0e6202d445ba0a260088] mm: introduce vma_start_read_locked{_nested} helpers
git bisect good 1062a24d0a17862738df0e6202d445ba0a260088
# bad: [c6fcd83658200c74c837c4a2e3f95f7eb487dd7d] mseal: remove can_do_mseal()
git bisect bad c6fcd83658200c74c837c4a2e3f95f7eb487dd7d
# bad: [2684a4244bfeac0c246a2610997dad78b2dc7877] mm/slab: allow freeptr_offset to be used with ctor
git bisect bad 2684a4244bfeac0c246a2610997dad78b2dc7877
# good: [98d5eefb975e46c182fc2de6a543e375481aedaf] mm: mark vma as detached until it's added into vma tree
git bisect good 98d5eefb975e46c182fc2de6a543e375481aedaf
# bad: [85ad413389aec04cfaaba043caa8128b76c6e491] mm: make vma cache SLAB_TYPESAFE_BY_RCU
git bisect bad 85ad413389aec04cfaaba043caa8128b76c6e491
# first bad commit: [85ad413389aec04cfaaba043caa8128b76c6e491] mm: make vma cache SLAB_TYPESAFE_BY_RCU

Suren Baghdasaryan Dec. 9, 2024, 8:28 p.m. UTC | #2

On Mon, Dec 9, 2024 at 9:35 AM Klara Modin <klarasmodin@gmail.com> wrote:
>
> Hi,
>
> On 2024-12-06 23:52, Suren Baghdasaryan wrote:
> > To enable SLAB_TYPESAFE_BY_RCU for vma cache we need to ensure that
> > object reuse before RCU grace period is over will be detected inside
> > lock_vma_under_rcu().
> > lock_vma_under_rcu() enters RCU read section, finds the vma at the
> > given address, locks the vma and checks if it got detached or remapped
> > to cover a different address range. These last checks are there
> > to ensure that the vma was not modified after we found it but before
> > locking it.
> > vma reuse introduces several new possibilities:
> > 1. vma can be reused after it was found but before it is locked;
> > 2. vma can be reused and reinitialized (including changing its vm_mm)
> > while being locked in vma_start_read();
> > 3. vma can be reused and reinitialized after it was found but before
> > it is locked, then attached at a new address or to a new mm while
> > read-locked;
> > For case #1 current checks will help detecting cases when:
> > - vma was reused but not yet added into the tree (detached check)
> > - vma was reused at a different address range (address check);
> > We are missing the check for vm_mm to ensure the reused vma was not
> > attached to a different mm. This patch adds the missing check.
> > For case #2, we pass mm to vma_start_read() to prevent access to
> > unstable vma->vm_mm. This might lead to vma_start_read() returning
> > a false locked result but that's not critical if it's rare because
> > it will only lead to a retry under mmap_lock.
> > For case #3, we ensure the order in which vma->detached flag and
> > vm_start/vm_end/vm_mm are set and checked. vma gets attached after
> > vm_start/vm_end/vm_mm were set and lock_vma_under_rcu() should check
> > vma->detached before checking vm_start/vm_end/vm_mm. This is required
> > because attaching vma happens without vma write-lock, as opposed to
> > vma detaching, which requires vma write-lock. This patch adds memory
> > barriers inside is_vma_detached() and vma_mark_attached() needed to
> > order reads and writes to vma->detached vs vm_start/vm_end/vm_mm.
> > After these provisions, SLAB_TYPESAFE_BY_RCU is added to vm_area_cachep.
> > This will facilitate vm_area_struct reuse and will minimize the number
> > of call_rcu() calls.
>
> This patch (85ad413389aec04cfaaba043caa8128b76c6e491 in next-20241209)
> seems to cause an oops on a MIPS board of mine (Cavium Octeon III)
> (abbreviated, full attached):

Thanks! This is really helpful because both errors are reported on
architectures which don't set CONFIG_PER_VMA_LOCK. This is a great
clue.
I'm investigating it and will post a fix asap.
Thanks,
Suren.

>
> CPU 2 Unable to handle kernel paging request at virtual address
> 0000000000000000, epc == ffffffff813a85a0, ra == ffffffff81390438
> Oops[#1]:
> CPU: 2 UID: 0 PID: 1 Comm: init Not tainted
> 6.13.0-rc1-00162-g85ad413389ae #156
> Call Trace:
> unlink_anon_vmas (mm/rmap.c:408)
> free_pgtables (mm/memory.c:393)
> vms_clear_ptes (mm/vma.c:1143)
> vms_complete_munmap_vmas (include/linux/mm.h:2737 mm/vma.c:1187)
> do_vmi_align_munmap (mm/vma.c:1452)
> __vm_munmap (mm/vma.c:2892)
> sys_munmap (mm/mmap.c:1053)
> syscall_common (arch/mips/kernel/scall64-n64.S:62)
>
> I saw that there's already a report, but maybe another arch can be
> useful for tracking this down.
>
> Please let me know if there's anything else you need.
>
> Regards,
> Klara Modin
>
> Link: https://lore.kernel.org/all/202412082208.db1fb2c9-lkp@intel.com
>
> >
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > ---
> >   include/linux/mm.h               |  36 +++++--
> >   include/linux/mm_types.h         |  10 +-
> >   include/linux/slab.h             |   6 --
> >   kernel/fork.c                    | 157 +++++++++++++++++++++++++------
> >   mm/memory.c                      |  15 ++-
> >   mm/vma.c                         |   2 +-
> >   tools/testing/vma/vma_internal.h |   7 +-
> >   7 files changed, 179 insertions(+), 54 deletions(-)
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 2bf38c1e9cca..3568bcbc7c81 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -257,7 +257,7 @@ struct vm_area_struct *vm_area_alloc(struct mm_struct *);
> >   struct vm_area_struct *vm_area_dup(struct vm_area_struct *);
> >   void vm_area_free(struct vm_area_struct *);
> >   /* Use only if VMA has no other users */
> > -void __vm_area_free(struct vm_area_struct *vma);
> > +void vm_area_free_unreachable(struct vm_area_struct *vma);
> >
> >   #ifndef CONFIG_MMU
> >   extern struct rb_root nommu_region_tree;
> > @@ -706,8 +706,10 @@ static inline void vma_lock_init(struct vm_area_struct *vma)
> >    * Try to read-lock a vma. The function is allowed to occasionally yield false
> >    * locked result to avoid performance overhead, in which case we fall back to
> >    * using mmap_lock. The function should never yield false unlocked result.
> > + * False locked result is possible if mm_lock_seq overflows or if vma gets
> > + * reused and attached to a different mm before we lock it.
> >    */
> > -static inline bool vma_start_read(struct vm_area_struct *vma)
> > +static inline bool vma_start_read(struct mm_struct *mm, struct vm_area_struct *vma)
> >   {
> >       /*
> >        * Check before locking. A race might cause false locked result.
> > @@ -716,7 +718,7 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
> >        * we don't rely on for anything - the mm_lock_seq read against which we
> >        * need ordering is below.
> >        */
> > -     if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(vma->vm_mm->mm_lock_seq.sequence))
> > +     if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(mm->mm_lock_seq.sequence))
> >               return false;
> >
> >       if (unlikely(down_read_trylock(&vma->vm_lock.lock) == 0))
> > @@ -733,7 +735,7 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
> >        * after it has been unlocked.
> >        * This pairs with RELEASE semantics in vma_end_write_all().
> >        */
> > -     if (unlikely(vma->vm_lock_seq == raw_read_seqcount(&vma->vm_mm->mm_lock_seq))) {
> > +     if (unlikely(vma->vm_lock_seq == raw_read_seqcount(&mm->mm_lock_seq))) {
> >               up_read(&vma->vm_lock.lock);
> >               return false;
> >       }
> > @@ -822,7 +824,15 @@ static inline void vma_assert_locked(struct vm_area_struct *vma)
> >
> >   static inline void vma_mark_attached(struct vm_area_struct *vma)
> >   {
> > -     vma->detached = false;
> > +     /*
> > +      * This pairs with smp_rmb() inside is_vma_detached().
> > +      * vma is marked attached after all vma modifications are done and it
> > +      * got added into the vma tree. All prior vma modifications should be
> > +      * made visible before marking the vma attached.
> > +      */
> > +     smp_wmb();
> > +     /* This pairs with READ_ONCE() in is_vma_detached(). */
> > +     WRITE_ONCE(vma->detached, false);
> >   }
> >
> >   static inline void vma_mark_detached(struct vm_area_struct *vma)
> > @@ -834,7 +844,18 @@ static inline void vma_mark_detached(struct vm_area_struct *vma)
> >
> >   static inline bool is_vma_detached(struct vm_area_struct *vma)
> >   {
> > -     return vma->detached;
> > +     bool detached;
> > +
> > +     /* This pairs with WRITE_ONCE() in vma_mark_attached(). */
> > +     detached = READ_ONCE(vma->detached);
> > +     /*
> > +      * This pairs with smp_wmb() inside vma_mark_attached() to ensure
> > +      * vma->detached is read before vma attributes read later inside
> > +      * lock_vma_under_rcu().
> > +      */
> > +     smp_rmb();
> > +
> > +     return detached;
> >   }
> >
> >   static inline void release_fault_lock(struct vm_fault *vmf)
> > @@ -859,7 +880,7 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> >   #else /* CONFIG_PER_VMA_LOCK */
> >
> >   static inline void vma_lock_init(struct vm_area_struct *vma) {}
> > -static inline bool vma_start_read(struct vm_area_struct *vma)
> > +static inline bool vma_start_read(struct mm_struct *mm, struct vm_area_struct *vma)
> >               { return false; }
> >   static inline void vma_end_read(struct vm_area_struct *vma) {}
> >   static inline void vma_start_write(struct vm_area_struct *vma) {}
> > @@ -893,6 +914,7 @@ static inline void assert_fault_locked(struct vm_fault *vmf)
> >
> >   extern const struct vm_operations_struct vma_dummy_vm_ops;
> >
> > +/* Use on VMAs not created using vm_area_alloc() */
> >   static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
> >   {
> >       memset(vma, 0, sizeof(*vma));
> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > index be3551654325..5d8779997266 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -543,6 +543,12 @@ static inline void *folio_get_private(struct folio *folio)
> >
> >   typedef unsigned long vm_flags_t;
> >
> > +/*
> > + * freeptr_t represents a SLUB freelist pointer, which might be encoded
> > + * and not dereferenceable if CONFIG_SLAB_FREELIST_HARDENED is enabled.
> > + */
> > +typedef struct { unsigned long v; } freeptr_t;
> > +
> >   /*
> >    * A region containing a mapping of a non-memory backed file under NOMMU
> >    * conditions.  These are held in a global tree and are pinned by the VMAs that
> > @@ -657,9 +663,7 @@ struct vm_area_struct {
> >                       unsigned long vm_start;
> >                       unsigned long vm_end;
> >               };
> > -#ifdef CONFIG_PER_VMA_LOCK
> > -             struct rcu_head vm_rcu; /* Used for deferred freeing. */
> > -#endif
> > +             freeptr_t vm_freeptr; /* Pointer used by SLAB_TYPESAFE_BY_RCU */
> >       };
> >
> >       /*
> > diff --git a/include/linux/slab.h b/include/linux/slab.h
> > index 10a971c2bde3..681b685b6c4e 100644
> > --- a/include/linux/slab.h
> > +++ b/include/linux/slab.h
> > @@ -234,12 +234,6 @@ enum _slab_flag_bits {
> >   #define SLAB_NO_OBJ_EXT             __SLAB_FLAG_UNUSED
> >   #endif
> >
> > -/*
> > - * freeptr_t represents a SLUB freelist pointer, which might be encoded
> > - * and not dereferenceable if CONFIG_SLAB_FREELIST_HARDENED is enabled.
> > - */
> > -typedef struct { unsigned long v; } freeptr_t;
> > -
> >   /*
> >    * ZERO_SIZE_PTR will be returned for zero sized kmalloc requests.
> >    *
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index 71990f46aa4e..e7e76a660e4c 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -436,6 +436,98 @@ static struct kmem_cache *vm_area_cachep;
> >   /* SLAB cache for mm_struct structures (tsk->mm) */
> >   static struct kmem_cache *mm_cachep;
> >
> > +static void vm_area_ctor(void *data)
> > +{
> > +     struct vm_area_struct *vma = (struct vm_area_struct *)data;
> > +
> > +#ifdef CONFIG_PER_VMA_LOCK
> > +     /* vma is not locked, can't use vma_mark_detached() */
> > +     vma->detached = true;
> > +#endif
> > +     INIT_LIST_HEAD(&vma->anon_vma_chain);
> > +     vma_lock_init(vma);
> > +}
> > +
> > +#ifdef CONFIG_PER_VMA_LOCK
> > +
> > +static void vma_clear(struct vm_area_struct *vma, struct mm_struct *mm)
> > +{
> > +     vma->vm_mm = mm;
> > +     vma->vm_ops = &vma_dummy_vm_ops;
> > +     vma->vm_start = 0;
> > +     vma->vm_end = 0;
> > +     vma->anon_vma = NULL;
> > +     vma->vm_pgoff = 0;
> > +     vma->vm_file = NULL;
> > +     vma->vm_private_data = NULL;
> > +     vm_flags_init(vma, 0);
> > +     memset(&vma->vm_page_prot, 0, sizeof(vma->vm_page_prot));
> > +     memset(&vma->shared, 0, sizeof(vma->shared));
> > +     memset(&vma->vm_userfaultfd_ctx, 0, sizeof(vma->vm_userfaultfd_ctx));
> > +     vma_numab_state_init(vma);
> > +#ifdef CONFIG_ANON_VMA_NAME
> > +     vma->anon_name = NULL;
> > +#endif
> > +#ifdef CONFIG_SWAP
> > +     memset(&vma->swap_readahead_info, 0, sizeof(vma->swap_readahead_info));
> > +#endif
> > +#ifndef CONFIG_MMU
> > +     vma->vm_region = NULL;
> > +#endif
> > +#ifdef CONFIG_NUMA
> > +     vma->vm_policy = NULL;
> > +#endif
> > +}
> > +
> > +static void vma_copy(const struct vm_area_struct *src, struct vm_area_struct *dest)
> > +{
> > +     dest->vm_mm = src->vm_mm;
> > +     dest->vm_ops = src->vm_ops;
> > +     dest->vm_start = src->vm_start;
> > +     dest->vm_end = src->vm_end;
> > +     dest->anon_vma = src->anon_vma;
> > +     dest->vm_pgoff = src->vm_pgoff;
> > +     dest->vm_file = src->vm_file;
> > +     dest->vm_private_data = src->vm_private_data;
> > +     vm_flags_init(dest, src->vm_flags);
> > +     memcpy(&dest->vm_page_prot, &src->vm_page_prot,
> > +            sizeof(dest->vm_page_prot));
> > +     memcpy(&dest->shared, &src->shared, sizeof(dest->shared));
> > +     memcpy(&dest->vm_userfaultfd_ctx, &src->vm_userfaultfd_ctx,
> > +            sizeof(dest->vm_userfaultfd_ctx));
> > +#ifdef CONFIG_ANON_VMA_NAME
> > +     dest->anon_name = src->anon_name;
> > +#endif
> > +#ifdef CONFIG_SWAP
> > +     memcpy(&dest->swap_readahead_info, &src->swap_readahead_info,
> > +            sizeof(dest->swap_readahead_info));
> > +#endif
> > +#ifndef CONFIG_MMU
> > +     dest->vm_region = src->vm_region;
> > +#endif
> > +#ifdef CONFIG_NUMA
> > +     dest->vm_policy = src->vm_policy;
> > +#endif
> > +}
> > +
> > +#else /* CONFIG_PER_VMA_LOCK */
> > +
> > +static void vma_clear(struct vm_area_struct *vma, struct mm_struct *mm)
> > +{
> > +     vma_init(vma, mm);
> > +}
> > +
> > +static void vma_copy(const struct vm_area_struct *src, struct vm_area_struct *dest)
> > +{
> > +     /*
> > +      * orig->shared.rb may be modified concurrently, but the clone
> > +      * will be reinitialized.
> > +      */
> > +     data_race(memcpy(dest, src, sizeof(*dest)));
> > +}
> > +
> > +#endif /* CONFIG_PER_VMA_LOCK */
> > +
> >   struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
> >   {
> >       struct vm_area_struct *vma;
> > @@ -444,7 +536,7 @@ struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
> >       if (!vma)
> >               return NULL;
> >
> > -     vma_init(vma, mm);
> > +     vma_clear(vma, mm);
> >
> >       return vma;
> >   }
> > @@ -458,49 +550,46 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
> >
> >       ASSERT_EXCLUSIVE_WRITER(orig->vm_flags);
> >       ASSERT_EXCLUSIVE_WRITER(orig->vm_file);
> > -     /*
> > -      * orig->shared.rb may be modified concurrently, but the clone
> > -      * will be reinitialized.
> > -      */
> > -     data_race(memcpy(new, orig, sizeof(*new)));
> > -     vma_lock_init(new);
> > -     INIT_LIST_HEAD(&new->anon_vma_chain);
> > -#ifdef CONFIG_PER_VMA_LOCK
> > -     /* vma is not locked, can't use vma_mark_detached() */
> > -     new->detached = true;
> > -#endif
> > +     vma_copy(orig, new);
> >       vma_numab_state_init(new);
> >       dup_anon_vma_name(orig, new);
> >
> >       return new;
> >   }
> >
> > -void __vm_area_free(struct vm_area_struct *vma)
> > +static void __vm_area_free(struct vm_area_struct *vma, bool unreachable)
> >   {
> > +#ifdef CONFIG_PER_VMA_LOCK
> > +     /*
> > +      * With SLAB_TYPESAFE_BY_RCU, vma can be reused and we need
> > +      * vma->detached to be set before vma is returned into the cache.
> > +      * This way reused object won't be used by readers until it's
> > +      * initialized and reattached.
> > +      * If vma is unreachable, there can be no other users and we
> > +      * can set vma->detached directly with no risk of a race.
> > +      * If vma is reachable, then it should have been already detached
> > +      * under vma write-lock or it was never attached.
> > +      */
> > +     if (unreachable)
> > +             vma->detached = true;
> > +     else
> > +             VM_BUG_ON_VMA(!is_vma_detached(vma), vma);
> > +     vma->vm_lock_seq = UINT_MAX;
> > +#endif
> > +     VM_BUG_ON_VMA(!list_empty(&vma->anon_vma_chain), vma);
> >       vma_numab_state_free(vma);
> >       free_anon_vma_name(vma);
> >       kmem_cache_free(vm_area_cachep, vma);
> >   }
> >
> > -#ifdef CONFIG_PER_VMA_LOCK
> > -static void vm_area_free_rcu_cb(struct rcu_head *head)
> > +void vm_area_free(struct vm_area_struct *vma)
> >   {
> > -     struct vm_area_struct *vma = container_of(head, struct vm_area_struct,
> > -                                               vm_rcu);
> > -
> > -     /* The vma should not be locked while being destroyed. */
> > -     VM_BUG_ON_VMA(rwsem_is_locked(&vma->vm_lock.lock), vma);
> > -     __vm_area_free(vma);
> > +     __vm_area_free(vma, false);
> >   }
> > -#endif
> >
> > -void vm_area_free(struct vm_area_struct *vma)
> > +void vm_area_free_unreachable(struct vm_area_struct *vma)
> >   {
> > -#ifdef CONFIG_PER_VMA_LOCK
> > -     call_rcu(&vma->vm_rcu, vm_area_free_rcu_cb);
> > -#else
> > -     __vm_area_free(vma);
> > -#endif
> > +     __vm_area_free(vma, true);
> >   }
> >
> >   static void account_kernel_stack(struct task_struct *tsk, int account)
> > @@ -3141,6 +3230,12 @@ void __init mm_cache_init(void)
> >
> >   void __init proc_caches_init(void)
> >   {
> > +     struct kmem_cache_args args = {
> > +             .use_freeptr_offset = true,
> > +             .freeptr_offset = offsetof(struct vm_area_struct, vm_freeptr),
> > +             .ctor = vm_area_ctor,
> > +     };
> > +
> >       sighand_cachep = kmem_cache_create("sighand_cache",
> >                       sizeof(struct sighand_struct), 0,
> >                       SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_TYPESAFE_BY_RCU|
> > @@ -3157,9 +3252,11 @@ void __init proc_caches_init(void)
> >                       sizeof(struct fs_struct), 0,
> >                       SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT,
> >                       NULL);
> > -     vm_area_cachep = KMEM_CACHE(vm_area_struct,
> > -                     SLAB_HWCACHE_ALIGN|SLAB_NO_MERGE|SLAB_PANIC|
> > +     vm_area_cachep = kmem_cache_create("vm_area_struct",
> > +                     sizeof(struct vm_area_struct), &args,
> > +                     SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_TYPESAFE_BY_RCU|
> >                       SLAB_ACCOUNT);
> > +
> >       mmap_init();
> >       nsproxy_cache_init();
> >   }
> > diff --git a/mm/memory.c b/mm/memory.c
> > index b252f19b28c9..6f4d4d423835 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -6368,10 +6368,16 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> >       if (!vma)
> >               goto inval;
> >
> > -     if (!vma_start_read(vma))
> > +     if (!vma_start_read(mm, vma))
> >               goto inval;
> >
> > -     /* Check if the VMA got isolated after we found it */
> > +     /*
> > +      * Check if the VMA got isolated after we found it.
> > +      * Note: vma we found could have been recycled and is being reattached.
> > +      * It's possible to attach a vma while it is read-locked, however a
> > +      * read-locked vma can't be detached (detaching requires write-locking).
> > +      * Therefore if this check passes, we have an attached and stable vma.
> > +      */
> >       if (is_vma_detached(vma)) {
> >               vma_end_read(vma);
> >               count_vm_vma_lock_event(VMA_LOCK_MISS);
> > @@ -6385,8 +6391,9 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> >        * fields are accessible for RCU readers.
> >        */
> >
> > -     /* Check since vm_start/vm_end might change before we lock the VMA */
> > -     if (unlikely(address < vma->vm_start || address >= vma->vm_end))
> > +     /* Check if the vma we locked is the right one. */
> > +     if (unlikely(vma->vm_mm != mm ||
> > +                  address < vma->vm_start || address >= vma->vm_end))
> >               goto inval_end_read;
> >
> >       rcu_read_unlock();
> > diff --git a/mm/vma.c b/mm/vma.c
> > index cdc63728f47f..648784416833 100644
> > --- a/mm/vma.c
> > +++ b/mm/vma.c
> > @@ -414,7 +414,7 @@ void remove_vma(struct vm_area_struct *vma, bool unreachable)
> >               fput(vma->vm_file);
> >       mpol_put(vma_policy(vma));
> >       if (unreachable)
> > -             __vm_area_free(vma);
> > +             vm_area_free_unreachable(vma);
> >       else
> >               vm_area_free(vma);
> >   }
> > diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
> > index 0cdc5f8c3d60..3eeb1317cc69 100644
> > --- a/tools/testing/vma/vma_internal.h
> > +++ b/tools/testing/vma/vma_internal.h
> > @@ -685,14 +685,15 @@ static inline void mpol_put(struct mempolicy *)
> >   {
> >   }
> >
> > -static inline void __vm_area_free(struct vm_area_struct *vma)
> > +static inline void vm_area_free(struct vm_area_struct *vma)
> >   {
> >       free(vma);
> >   }
> >
> > -static inline void vm_area_free(struct vm_area_struct *vma)
> > +static inline void vm_area_free_unreachable(struct vm_area_struct *vma)
> >   {
> > -     __vm_area_free(vma);
> > +     vma->detached = true;
> > +     vm_area_free(vma);
> >   }
> >
> >   static inline void lru_add_drain(void)

Suren Baghdasaryan Dec. 9, 2024, 10:19 p.m. UTC | #3

On Mon, Dec 9, 2024 at 12:28 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Mon, Dec 9, 2024 at 9:35 AM Klara Modin <klarasmodin@gmail.com> wrote:
> >
> > Hi,
> >
> > On 2024-12-06 23:52, Suren Baghdasaryan wrote:
> > > To enable SLAB_TYPESAFE_BY_RCU for vma cache we need to ensure that
> > > object reuse before RCU grace period is over will be detected inside
> > > lock_vma_under_rcu().
> > > lock_vma_under_rcu() enters RCU read section, finds the vma at the
> > > given address, locks the vma and checks if it got detached or remapped
> > > to cover a different address range. These last checks are there
> > > to ensure that the vma was not modified after we found it but before
> > > locking it.
> > > vma reuse introduces several new possibilities:
> > > 1. vma can be reused after it was found but before it is locked;
> > > 2. vma can be reused and reinitialized (including changing its vm_mm)
> > > while being locked in vma_start_read();
> > > 3. vma can be reused and reinitialized after it was found but before
> > > it is locked, then attached at a new address or to a new mm while
> > > read-locked;
> > > For case #1 current checks will help detecting cases when:
> > > - vma was reused but not yet added into the tree (detached check)
> > > - vma was reused at a different address range (address check);
> > > We are missing the check for vm_mm to ensure the reused vma was not
> > > attached to a different mm. This patch adds the missing check.
> > > For case #2, we pass mm to vma_start_read() to prevent access to
> > > unstable vma->vm_mm. This might lead to vma_start_read() returning
> > > a false locked result but that's not critical if it's rare because
> > > it will only lead to a retry under mmap_lock.
> > > For case #3, we ensure the order in which vma->detached flag and
> > > vm_start/vm_end/vm_mm are set and checked. vma gets attached after
> > > vm_start/vm_end/vm_mm were set and lock_vma_under_rcu() should check
> > > vma->detached before checking vm_start/vm_end/vm_mm. This is required
> > > because attaching vma happens without vma write-lock, as opposed to
> > > vma detaching, which requires vma write-lock. This patch adds memory
> > > barriers inside is_vma_detached() and vma_mark_attached() needed to
> > > order reads and writes to vma->detached vs vm_start/vm_end/vm_mm.
> > > After these provisions, SLAB_TYPESAFE_BY_RCU is added to vm_area_cachep.
> > > This will facilitate vm_area_struct reuse and will minimize the number
> > > of call_rcu() calls.
> >
> > This patch (85ad413389aec04cfaaba043caa8128b76c6e491 in next-20241209)
> > seems to cause an oops on a MIPS board of mine (Cavium Octeon III)
> > (abbreviated, full attached):
>
> Thanks! This is really helpful because both errors are reported on
> architectures which don't set CONFIG_PER_VMA_LOCK. This is a great
> clue.
> I'm investigating it and will post a fix asap.

I think https://lore.kernel.org/all/20241209221028.1644210-1-surenb@google.com/
should fix this.

> Thanks,
> Suren.
>
> >
> > CPU 2 Unable to handle kernel paging request at virtual address
> > 0000000000000000, epc == ffffffff813a85a0, ra == ffffffff81390438
> > Oops[#1]:
> > CPU: 2 UID: 0 PID: 1 Comm: init Not tainted
> > 6.13.0-rc1-00162-g85ad413389ae #156
> > Call Trace:
> > unlink_anon_vmas (mm/rmap.c:408)
> > free_pgtables (mm/memory.c:393)
> > vms_clear_ptes (mm/vma.c:1143)
> > vms_complete_munmap_vmas (include/linux/mm.h:2737 mm/vma.c:1187)
> > do_vmi_align_munmap (mm/vma.c:1452)
> > __vm_munmap (mm/vma.c:2892)
> > sys_munmap (mm/mmap.c:1053)
> > syscall_common (arch/mips/kernel/scall64-n64.S:62)
> >
> > I saw that there's already a report, but maybe another arch can be
> > useful for tracking this down.
> >
> > Please let me know if there's anything else you need.
> >
> > Regards,
> > Klara Modin
> >
> > Link: https://lore.kernel.org/all/202412082208.db1fb2c9-lkp@intel.com
> >
> > >
> > > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > > ---
> > >   include/linux/mm.h               |  36 +++++--
> > >   include/linux/mm_types.h         |  10 +-
> > >   include/linux/slab.h             |   6 --
> > >   kernel/fork.c                    | 157 +++++++++++++++++++++++++------
> > >   mm/memory.c                      |  15 ++-
> > >   mm/vma.c                         |   2 +-
> > >   tools/testing/vma/vma_internal.h |   7 +-
> > >   7 files changed, 179 insertions(+), 54 deletions(-)
> > >
> > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > index 2bf38c1e9cca..3568bcbc7c81 100644
> > > --- a/include/linux/mm.h
> > > +++ b/include/linux/mm.h
> > > @@ -257,7 +257,7 @@ struct vm_area_struct *vm_area_alloc(struct mm_struct *);
> > >   struct vm_area_struct *vm_area_dup(struct vm_area_struct *);
> > >   void vm_area_free(struct vm_area_struct *);
> > >   /* Use only if VMA has no other users */
> > > -void __vm_area_free(struct vm_area_struct *vma);
> > > +void vm_area_free_unreachable(struct vm_area_struct *vma);
> > >
> > >   #ifndef CONFIG_MMU
> > >   extern struct rb_root nommu_region_tree;
> > > @@ -706,8 +706,10 @@ static inline void vma_lock_init(struct vm_area_struct *vma)
> > >    * Try to read-lock a vma. The function is allowed to occasionally yield false
> > >    * locked result to avoid performance overhead, in which case we fall back to
> > >    * using mmap_lock. The function should never yield false unlocked result.
> > > + * False locked result is possible if mm_lock_seq overflows or if vma gets
> > > + * reused and attached to a different mm before we lock it.
> > >    */
> > > -static inline bool vma_start_read(struct vm_area_struct *vma)
> > > +static inline bool vma_start_read(struct mm_struct *mm, struct vm_area_struct *vma)
> > >   {
> > >       /*
> > >        * Check before locking. A race might cause false locked result.
> > > @@ -716,7 +718,7 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
> > >        * we don't rely on for anything - the mm_lock_seq read against which we
> > >        * need ordering is below.
> > >        */
> > > -     if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(vma->vm_mm->mm_lock_seq.sequence))
> > > +     if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(mm->mm_lock_seq.sequence))
> > >               return false;
> > >
> > >       if (unlikely(down_read_trylock(&vma->vm_lock.lock) == 0))
> > > @@ -733,7 +735,7 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
> > >        * after it has been unlocked.
> > >        * This pairs with RELEASE semantics in vma_end_write_all().
> > >        */
> > > -     if (unlikely(vma->vm_lock_seq == raw_read_seqcount(&vma->vm_mm->mm_lock_seq))) {
> > > +     if (unlikely(vma->vm_lock_seq == raw_read_seqcount(&mm->mm_lock_seq))) {
> > >               up_read(&vma->vm_lock.lock);
> > >               return false;
> > >       }
> > > @@ -822,7 +824,15 @@ static inline void vma_assert_locked(struct vm_area_struct *vma)
> > >
> > >   static inline void vma_mark_attached(struct vm_area_struct *vma)
> > >   {
> > > -     vma->detached = false;
> > > +     /*
> > > +      * This pairs with smp_rmb() inside is_vma_detached().
> > > +      * vma is marked attached after all vma modifications are done and it
> > > +      * got added into the vma tree. All prior vma modifications should be
> > > +      * made visible before marking the vma attached.
> > > +      */
> > > +     smp_wmb();
> > > +     /* This pairs with READ_ONCE() in is_vma_detached(). */
> > > +     WRITE_ONCE(vma->detached, false);
> > >   }
> > >
> > >   static inline void vma_mark_detached(struct vm_area_struct *vma)
> > > @@ -834,7 +844,18 @@ static inline void vma_mark_detached(struct vm_area_struct *vma)
> > >
> > >   static inline bool is_vma_detached(struct vm_area_struct *vma)
> > >   {
> > > -     return vma->detached;
> > > +     bool detached;
> > > +
> > > +     /* This pairs with WRITE_ONCE() in vma_mark_attached(). */
> > > +     detached = READ_ONCE(vma->detached);
> > > +     /*
> > > +      * This pairs with smp_wmb() inside vma_mark_attached() to ensure
> > > +      * vma->detached is read before vma attributes read later inside
> > > +      * lock_vma_under_rcu().
> > > +      */
> > > +     smp_rmb();
> > > +
> > > +     return detached;
> > >   }
> > >
> > >   static inline void release_fault_lock(struct vm_fault *vmf)
> > > @@ -859,7 +880,7 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> > >   #else /* CONFIG_PER_VMA_LOCK */
> > >
> > >   static inline void vma_lock_init(struct vm_area_struct *vma) {}
> > > -static inline bool vma_start_read(struct vm_area_struct *vma)
> > > +static inline bool vma_start_read(struct mm_struct *mm, struct vm_area_struct *vma)
> > >               { return false; }
> > >   static inline void vma_end_read(struct vm_area_struct *vma) {}
> > >   static inline void vma_start_write(struct vm_area_struct *vma) {}
> > > @@ -893,6 +914,7 @@ static inline void assert_fault_locked(struct vm_fault *vmf)
> > >
> > >   extern const struct vm_operations_struct vma_dummy_vm_ops;
> > >
> > > +/* Use on VMAs not created using vm_area_alloc() */
> > >   static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
> > >   {
> > >       memset(vma, 0, sizeof(*vma));
> > > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > > index be3551654325..5d8779997266 100644
> > > --- a/include/linux/mm_types.h
> > > +++ b/include/linux/mm_types.h
> > > @@ -543,6 +543,12 @@ static inline void *folio_get_private(struct folio *folio)
> > >
> > >   typedef unsigned long vm_flags_t;
> > >
> > > +/*
> > > + * freeptr_t represents a SLUB freelist pointer, which might be encoded
> > > + * and not dereferenceable if CONFIG_SLAB_FREELIST_HARDENED is enabled.
> > > + */
> > > +typedef struct { unsigned long v; } freeptr_t;
> > > +
> > >   /*
> > >    * A region containing a mapping of a non-memory backed file under NOMMU
> > >    * conditions.  These are held in a global tree and are pinned by the VMAs that
> > > @@ -657,9 +663,7 @@ struct vm_area_struct {
> > >                       unsigned long vm_start;
> > >                       unsigned long vm_end;
> > >               };
> > > -#ifdef CONFIG_PER_VMA_LOCK
> > > -             struct rcu_head vm_rcu; /* Used for deferred freeing. */
> > > -#endif
> > > +             freeptr_t vm_freeptr; /* Pointer used by SLAB_TYPESAFE_BY_RCU */
> > >       };
> > >
> > >       /*
> > > diff --git a/include/linux/slab.h b/include/linux/slab.h
> > > index 10a971c2bde3..681b685b6c4e 100644
> > > --- a/include/linux/slab.h
> > > +++ b/include/linux/slab.h
> > > @@ -234,12 +234,6 @@ enum _slab_flag_bits {
> > >   #define SLAB_NO_OBJ_EXT             __SLAB_FLAG_UNUSED
> > >   #endif
> > >
> > > -/*
> > > - * freeptr_t represents a SLUB freelist pointer, which might be encoded
> > > - * and not dereferenceable if CONFIG_SLAB_FREELIST_HARDENED is enabled.
> > > - */
> > > -typedef struct { unsigned long v; } freeptr_t;
> > > -
> > >   /*
> > >    * ZERO_SIZE_PTR will be returned for zero sized kmalloc requests.
> > >    *
> > > diff --git a/kernel/fork.c b/kernel/fork.c
> > > index 71990f46aa4e..e7e76a660e4c 100644
> > > --- a/kernel/fork.c
> > > +++ b/kernel/fork.c
> > > @@ -436,6 +436,98 @@ static struct kmem_cache *vm_area_cachep;
> > >   /* SLAB cache for mm_struct structures (tsk->mm) */
> > >   static struct kmem_cache *mm_cachep;
> > >
> > > +static void vm_area_ctor(void *data)
> > > +{
> > > +     struct vm_area_struct *vma = (struct vm_area_struct *)data;
> > > +
> > > +#ifdef CONFIG_PER_VMA_LOCK
> > > +     /* vma is not locked, can't use vma_mark_detached() */
> > > +     vma->detached = true;
> > > +#endif
> > > +     INIT_LIST_HEAD(&vma->anon_vma_chain);
> > > +     vma_lock_init(vma);
> > > +}
> > > +
> > > +#ifdef CONFIG_PER_VMA_LOCK
> > > +
> > > +static void vma_clear(struct vm_area_struct *vma, struct mm_struct *mm)
> > > +{
> > > +     vma->vm_mm = mm;
> > > +     vma->vm_ops = &vma_dummy_vm_ops;
> > > +     vma->vm_start = 0;
> > > +     vma->vm_end = 0;
> > > +     vma->anon_vma = NULL;
> > > +     vma->vm_pgoff = 0;
> > > +     vma->vm_file = NULL;
> > > +     vma->vm_private_data = NULL;
> > > +     vm_flags_init(vma, 0);
> > > +     memset(&vma->vm_page_prot, 0, sizeof(vma->vm_page_prot));
> > > +     memset(&vma->shared, 0, sizeof(vma->shared));
> > > +     memset(&vma->vm_userfaultfd_ctx, 0, sizeof(vma->vm_userfaultfd_ctx));
> > > +     vma_numab_state_init(vma);
> > > +#ifdef CONFIG_ANON_VMA_NAME
> > > +     vma->anon_name = NULL;
> > > +#endif
> > > +#ifdef CONFIG_SWAP
> > > +     memset(&vma->swap_readahead_info, 0, sizeof(vma->swap_readahead_info));
> > > +#endif
> > > +#ifndef CONFIG_MMU
> > > +     vma->vm_region = NULL;
> > > +#endif
> > > +#ifdef CONFIG_NUMA
> > > +     vma->vm_policy = NULL;
> > > +#endif
> > > +}
> > > +
> > > +static void vma_copy(const struct vm_area_struct *src, struct vm_area_struct *dest)
> > > +{
> > > +     dest->vm_mm = src->vm_mm;
> > > +     dest->vm_ops = src->vm_ops;
> > > +     dest->vm_start = src->vm_start;
> > > +     dest->vm_end = src->vm_end;
> > > +     dest->anon_vma = src->anon_vma;
> > > +     dest->vm_pgoff = src->vm_pgoff;
> > > +     dest->vm_file = src->vm_file;
> > > +     dest->vm_private_data = src->vm_private_data;
> > > +     vm_flags_init(dest, src->vm_flags);
> > > +     memcpy(&dest->vm_page_prot, &src->vm_page_prot,
> > > +            sizeof(dest->vm_page_prot));
> > > +     memcpy(&dest->shared, &src->shared, sizeof(dest->shared));
> > > +     memcpy(&dest->vm_userfaultfd_ctx, &src->vm_userfaultfd_ctx,
> > > +            sizeof(dest->vm_userfaultfd_ctx));
> > > +#ifdef CONFIG_ANON_VMA_NAME
> > > +     dest->anon_name = src->anon_name;
> > > +#endif
> > > +#ifdef CONFIG_SWAP
> > > +     memcpy(&dest->swap_readahead_info, &src->swap_readahead_info,
> > > +            sizeof(dest->swap_readahead_info));
> > > +#endif
> > > +#ifndef CONFIG_MMU
> > > +     dest->vm_region = src->vm_region;
> > > +#endif
> > > +#ifdef CONFIG_NUMA
> > > +     dest->vm_policy = src->vm_policy;
> > > +#endif
> > > +}
> > > +
> > > +#else /* CONFIG_PER_VMA_LOCK */
> > > +
> > > +static void vma_clear(struct vm_area_struct *vma, struct mm_struct *mm)
> > > +{
> > > +     vma_init(vma, mm);
> > > +}
> > > +
> > > +static void vma_copy(const struct vm_area_struct *src, struct vm_area_struct *dest)
> > > +{
> > > +     /*
> > > +      * orig->shared.rb may be modified concurrently, but the clone
> > > +      * will be reinitialized.
> > > +      */
> > > +     data_race(memcpy(dest, src, sizeof(*dest)));
> > > +}
> > > +
> > > +#endif /* CONFIG_PER_VMA_LOCK */
> > > +
> > >   struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
> > >   {
> > >       struct vm_area_struct *vma;
> > > @@ -444,7 +536,7 @@ struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
> > >       if (!vma)
> > >               return NULL;
> > >
> > > -     vma_init(vma, mm);
> > > +     vma_clear(vma, mm);
> > >
> > >       return vma;
> > >   }
> > > @@ -458,49 +550,46 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
> > >
> > >       ASSERT_EXCLUSIVE_WRITER(orig->vm_flags);
> > >       ASSERT_EXCLUSIVE_WRITER(orig->vm_file);
> > > -     /*
> > > -      * orig->shared.rb may be modified concurrently, but the clone
> > > -      * will be reinitialized.
> > > -      */
> > > -     data_race(memcpy(new, orig, sizeof(*new)));
> > > -     vma_lock_init(new);
> > > -     INIT_LIST_HEAD(&new->anon_vma_chain);
> > > -#ifdef CONFIG_PER_VMA_LOCK
> > > -     /* vma is not locked, can't use vma_mark_detached() */
> > > -     new->detached = true;
> > > -#endif
> > > +     vma_copy(orig, new);
> > >       vma_numab_state_init(new);
> > >       dup_anon_vma_name(orig, new);
> > >
> > >       return new;
> > >   }
> > >
> > > -void __vm_area_free(struct vm_area_struct *vma)
> > > +static void __vm_area_free(struct vm_area_struct *vma, bool unreachable)
> > >   {
> > > +#ifdef CONFIG_PER_VMA_LOCK
> > > +     /*
> > > +      * With SLAB_TYPESAFE_BY_RCU, vma can be reused and we need
> > > +      * vma->detached to be set before vma is returned into the cache.
> > > +      * This way reused object won't be used by readers until it's
> > > +      * initialized and reattached.
> > > +      * If vma is unreachable, there can be no other users and we
> > > +      * can set vma->detached directly with no risk of a race.
> > > +      * If vma is reachable, then it should have been already detached
> > > +      * under vma write-lock or it was never attached.
> > > +      */
> > > +     if (unreachable)
> > > +             vma->detached = true;
> > > +     else
> > > +             VM_BUG_ON_VMA(!is_vma_detached(vma), vma);
> > > +     vma->vm_lock_seq = UINT_MAX;
> > > +#endif
> > > +     VM_BUG_ON_VMA(!list_empty(&vma->anon_vma_chain), vma);
> > >       vma_numab_state_free(vma);
> > >       free_anon_vma_name(vma);
> > >       kmem_cache_free(vm_area_cachep, vma);
> > >   }
> > >
> > > -#ifdef CONFIG_PER_VMA_LOCK
> > > -static void vm_area_free_rcu_cb(struct rcu_head *head)
> > > +void vm_area_free(struct vm_area_struct *vma)
> > >   {
> > > -     struct vm_area_struct *vma = container_of(head, struct vm_area_struct,
> > > -                                               vm_rcu);
> > > -
> > > -     /* The vma should not be locked while being destroyed. */
> > > -     VM_BUG_ON_VMA(rwsem_is_locked(&vma->vm_lock.lock), vma);
> > > -     __vm_area_free(vma);
> > > +     __vm_area_free(vma, false);
> > >   }
> > > -#endif
> > >
> > > -void vm_area_free(struct vm_area_struct *vma)
> > > +void vm_area_free_unreachable(struct vm_area_struct *vma)
> > >   {
> > > -#ifdef CONFIG_PER_VMA_LOCK
> > > -     call_rcu(&vma->vm_rcu, vm_area_free_rcu_cb);
> > > -#else
> > > -     __vm_area_free(vma);
> > > -#endif
> > > +     __vm_area_free(vma, true);
> > >   }
> > >
> > >   static void account_kernel_stack(struct task_struct *tsk, int account)
> > > @@ -3141,6 +3230,12 @@ void __init mm_cache_init(void)
> > >
> > >   void __init proc_caches_init(void)
> > >   {
> > > +     struct kmem_cache_args args = {
> > > +             .use_freeptr_offset = true,
> > > +             .freeptr_offset = offsetof(struct vm_area_struct, vm_freeptr),
> > > +             .ctor = vm_area_ctor,
> > > +     };
> > > +
> > >       sighand_cachep = kmem_cache_create("sighand_cache",
> > >                       sizeof(struct sighand_struct), 0,
> > >                       SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_TYPESAFE_BY_RCU|
> > > @@ -3157,9 +3252,11 @@ void __init proc_caches_init(void)
> > >                       sizeof(struct fs_struct), 0,
> > >                       SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT,
> > >                       NULL);
> > > -     vm_area_cachep = KMEM_CACHE(vm_area_struct,
> > > -                     SLAB_HWCACHE_ALIGN|SLAB_NO_MERGE|SLAB_PANIC|
> > > +     vm_area_cachep = kmem_cache_create("vm_area_struct",
> > > +                     sizeof(struct vm_area_struct), &args,
> > > +                     SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_TYPESAFE_BY_RCU|
> > >                       SLAB_ACCOUNT);
> > > +
> > >       mmap_init();
> > >       nsproxy_cache_init();
> > >   }
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index b252f19b28c9..6f4d4d423835 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -6368,10 +6368,16 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> > >       if (!vma)
> > >               goto inval;
> > >
> > > -     if (!vma_start_read(vma))
> > > +     if (!vma_start_read(mm, vma))
> > >               goto inval;
> > >
> > > -     /* Check if the VMA got isolated after we found it */
> > > +     /*
> > > +      * Check if the VMA got isolated after we found it.
> > > +      * Note: vma we found could have been recycled and is being reattached.
> > > +      * It's possible to attach a vma while it is read-locked, however a
> > > +      * read-locked vma can't be detached (detaching requires write-locking).
> > > +      * Therefore if this check passes, we have an attached and stable vma.
> > > +      */
> > >       if (is_vma_detached(vma)) {
> > >               vma_end_read(vma);
> > >               count_vm_vma_lock_event(VMA_LOCK_MISS);
> > > @@ -6385,8 +6391,9 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> > >        * fields are accessible for RCU readers.
> > >        */
> > >
> > > -     /* Check since vm_start/vm_end might change before we lock the VMA */
> > > -     if (unlikely(address < vma->vm_start || address >= vma->vm_end))
> > > +     /* Check if the vma we locked is the right one. */
> > > +     if (unlikely(vma->vm_mm != mm ||
> > > +                  address < vma->vm_start || address >= vma->vm_end))
> > >               goto inval_end_read;
> > >
> > >       rcu_read_unlock();
> > > diff --git a/mm/vma.c b/mm/vma.c
> > > index cdc63728f47f..648784416833 100644
> > > --- a/mm/vma.c
> > > +++ b/mm/vma.c
> > > @@ -414,7 +414,7 @@ void remove_vma(struct vm_area_struct *vma, bool unreachable)
> > >               fput(vma->vm_file);
> > >       mpol_put(vma_policy(vma));
> > >       if (unreachable)
> > > -             __vm_area_free(vma);
> > > +             vm_area_free_unreachable(vma);
> > >       else
> > >               vm_area_free(vma);
> > >   }
> > > diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
> > > index 0cdc5f8c3d60..3eeb1317cc69 100644
> > > --- a/tools/testing/vma/vma_internal.h
> > > +++ b/tools/testing/vma/vma_internal.h
> > > @@ -685,14 +685,15 @@ static inline void mpol_put(struct mempolicy *)
> > >   {
> > >   }
> > >
> > > -static inline void __vm_area_free(struct vm_area_struct *vma)
> > > +static inline void vm_area_free(struct vm_area_struct *vma)
> > >   {
> > >       free(vma);
> > >   }
> > >
> > > -static inline void vm_area_free(struct vm_area_struct *vma)
> > > +static inline void vm_area_free_unreachable(struct vm_area_struct *vma)
> > >   {
> > > -     __vm_area_free(vma);
> > > +     vma->detached = true;
> > > +     vm_area_free(vma);
> > >   }
> > >
> > >   static inline void lru_add_drain(void)

Vlastimil Babka Dec. 10, 2024, 12:06 p.m. UTC | #4

On 12/6/24 23:52, Suren Baghdasaryan wrote:
> To enable SLAB_TYPESAFE_BY_RCU for vma cache we need to ensure that
> object reuse before RCU grace period is over will be detected inside
> lock_vma_under_rcu().
> lock_vma_under_rcu() enters RCU read section, finds the vma at the
> given address, locks the vma and checks if it got detached or remapped
> to cover a different address range. These last checks are there
> to ensure that the vma was not modified after we found it but before
> locking it.
> vma reuse introduces several new possibilities:
> 1. vma can be reused after it was found but before it is locked;
> 2. vma can be reused and reinitialized (including changing its vm_mm)
> while being locked in vma_start_read();
> 3. vma can be reused and reinitialized after it was found but before
> it is locked, then attached at a new address or to a new mm while
> read-locked;
> For case #1 current checks will help detecting cases when:
> - vma was reused but not yet added into the tree (detached check)
> - vma was reused at a different address range (address check);
> We are missing the check for vm_mm to ensure the reused vma was not
> attached to a different mm. This patch adds the missing check.
> For case #2, we pass mm to vma_start_read() to prevent access to
> unstable vma->vm_mm. This might lead to vma_start_read() returning
> a false locked result but that's not critical if it's rare because
> it will only lead to a retry under mmap_lock.
> For case #3, we ensure the order in which vma->detached flag and
> vm_start/vm_end/vm_mm are set and checked. vma gets attached after
> vm_start/vm_end/vm_mm were set and lock_vma_under_rcu() should check
> vma->detached before checking vm_start/vm_end/vm_mm. This is required
> because attaching vma happens without vma write-lock, as opposed to
> vma detaching, which requires vma write-lock. This patch adds memory
> barriers inside is_vma_detached() and vma_mark_attached() needed to
> order reads and writes to vma->detached vs vm_start/vm_end/vm_mm.
> After these provisions, SLAB_TYPESAFE_BY_RCU is added to vm_area_cachep.
> This will facilitate vm_area_struct reuse and will minimize the number
> of call_rcu() calls.
> 
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> ---
>  include/linux/mm.h               |  36 +++++--
>  include/linux/mm_types.h         |  10 +-
>  include/linux/slab.h             |   6 --
>  kernel/fork.c                    | 157 +++++++++++++++++++++++++------
>  mm/memory.c                      |  15 ++-
>  mm/vma.c                         |   2 +-
>  tools/testing/vma/vma_internal.h |   7 +-
>  7 files changed, 179 insertions(+), 54 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 2bf38c1e9cca..3568bcbc7c81 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -257,7 +257,7 @@ struct vm_area_struct *vm_area_alloc(struct mm_struct *);
>  struct vm_area_struct *vm_area_dup(struct vm_area_struct *);
>  void vm_area_free(struct vm_area_struct *);
>  /* Use only if VMA has no other users */
> -void __vm_area_free(struct vm_area_struct *vma);
> +void vm_area_free_unreachable(struct vm_area_struct *vma);
>  
>  #ifndef CONFIG_MMU
>  extern struct rb_root nommu_region_tree;
> @@ -706,8 +706,10 @@ static inline void vma_lock_init(struct vm_area_struct *vma)
>   * Try to read-lock a vma. The function is allowed to occasionally yield false
>   * locked result to avoid performance overhead, in which case we fall back to
>   * using mmap_lock. The function should never yield false unlocked result.
> + * False locked result is possible if mm_lock_seq overflows or if vma gets
> + * reused and attached to a different mm before we lock it.
>   */
> -static inline bool vma_start_read(struct vm_area_struct *vma)
> +static inline bool vma_start_read(struct mm_struct *mm, struct vm_area_struct *vma)
>  {
>  	/*
>  	 * Check before locking. A race might cause false locked result.
> @@ -716,7 +718,7 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
>  	 * we don't rely on for anything - the mm_lock_seq read against which we
>  	 * need ordering is below.
>  	 */
> -	if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(vma->vm_mm->mm_lock_seq.sequence))
> +	if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(mm->mm_lock_seq.sequence))
>  		return false;
>  
>  	if (unlikely(down_read_trylock(&vma->vm_lock.lock) == 0))
> @@ -733,7 +735,7 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
>  	 * after it has been unlocked.
>  	 * This pairs with RELEASE semantics in vma_end_write_all().
>  	 */
> -	if (unlikely(vma->vm_lock_seq == raw_read_seqcount(&vma->vm_mm->mm_lock_seq))) {
> +	if (unlikely(vma->vm_lock_seq == raw_read_seqcount(&mm->mm_lock_seq))) {
>  		up_read(&vma->vm_lock.lock);
>  		return false;
>  	}

This could have been perhaps another preparatory patch to make this one smaller?

>  
> +static void vm_area_ctor(void *data)
> +{
> +	struct vm_area_struct *vma = (struct vm_area_struct *)data;
> +
> +#ifdef CONFIG_PER_VMA_LOCK
> +	/* vma is not locked, can't use vma_mark_detached() */
> +	vma->detached = true;
> +#endif
> +	INIT_LIST_HEAD(&vma->anon_vma_chain);
> +	vma_lock_init(vma);
> +}
> +
> +#ifdef CONFIG_PER_VMA_LOCK
> +
> +static void vma_clear(struct vm_area_struct *vma, struct mm_struct *mm)
> +{
> +	vma->vm_mm = mm;
> +	vma->vm_ops = &vma_dummy_vm_ops;
> +	vma->vm_start = 0;
> +	vma->vm_end = 0;
> +	vma->anon_vma = NULL;
> +	vma->vm_pgoff = 0;
> +	vma->vm_file = NULL;
> +	vma->vm_private_data = NULL;
> +	vm_flags_init(vma, 0);
> +	memset(&vma->vm_page_prot, 0, sizeof(vma->vm_page_prot));
> +	memset(&vma->shared, 0, sizeof(vma->shared));
> +	memset(&vma->vm_userfaultfd_ctx, 0, sizeof(vma->vm_userfaultfd_ctx));
> +	vma_numab_state_init(vma);
> +#ifdef CONFIG_ANON_VMA_NAME
> +	vma->anon_name = NULL;
> +#endif
> +#ifdef CONFIG_SWAP
> +	memset(&vma->swap_readahead_info, 0, sizeof(vma->swap_readahead_info));
> +#endif
> +#ifndef CONFIG_MMU
> +	vma->vm_region = NULL;
> +#endif
> +#ifdef CONFIG_NUMA
> +	vma->vm_policy = NULL;
> +#endif

This isn't the ideal pattern I think, now that we have a ctor. Ideally the
ctor would do all this (except setting the vm_mm), and then we need to make
sure it's also done when freeing the vma, to make sure the freed object is
in the same state as a new object after the constructor.

On freeing, things like numab_state and anon_name could be NULL'd (by the
respective destructors) only when they are non-NULL and thus freeing the
objects pointed to. vm_policy and vm_file could perhaps be handled same way
after some refactoring (see remove_vma()), vma_dummy_vm_ops are possibly
already reset by vma_close(), etc.

> +}
> +
> +static void vma_copy(const struct vm_area_struct *src, struct vm_area_struct *dest)
> +{
> +	dest->vm_mm = src->vm_mm;
> +	dest->vm_ops = src->vm_ops;
> +	dest->vm_start = src->vm_start;
> +	dest->vm_end = src->vm_end;
> +	dest->anon_vma = src->anon_vma;
> +	dest->vm_pgoff = src->vm_pgoff;
> +	dest->vm_file = src->vm_file;
> +	dest->vm_private_data = src->vm_private_data;
> +	vm_flags_init(dest, src->vm_flags);
> +	memcpy(&dest->vm_page_prot, &src->vm_page_prot,
> +	       sizeof(dest->vm_page_prot));
> +	memcpy(&dest->shared, &src->shared, sizeof(dest->shared));
> +	memcpy(&dest->vm_userfaultfd_ctx, &src->vm_userfaultfd_ctx,
> +	       sizeof(dest->vm_userfaultfd_ctx));
> +#ifdef CONFIG_ANON_VMA_NAME
> +	dest->anon_name = src->anon_name;
> +#endif
> +#ifdef CONFIG_SWAP
> +	memcpy(&dest->swap_readahead_info, &src->swap_readahead_info,
> +	       sizeof(dest->swap_readahead_info));
> +#endif
> +#ifndef CONFIG_MMU
> +	dest->vm_region = src->vm_region;
> +#endif
> +#ifdef CONFIG_NUMA
> +	dest->vm_policy = src->vm_policy;
> +#endif
> +}
> +
> +#else /* CONFIG_PER_VMA_LOCK */

Vlastimil Babka Dec. 10, 2024, 2:21 p.m. UTC | #5

On 12/6/24 23:52, Suren Baghdasaryan wrote:
> To enable SLAB_TYPESAFE_BY_RCU for vma cache we need to ensure that
> object reuse before RCU grace period is over will be detected inside
> lock_vma_under_rcu().
> lock_vma_under_rcu() enters RCU read section, finds the vma at the
> given address, locks the vma and checks if it got detached or remapped
> to cover a different address range. These last checks are there
> to ensure that the vma was not modified after we found it but before
> locking it.
> vma reuse introduces several new possibilities:
> 1. vma can be reused after it was found but before it is locked;
> 2. vma can be reused and reinitialized (including changing its vm_mm)
> while being locked in vma_start_read();
> 3. vma can be reused and reinitialized after it was found but before
> it is locked, then attached at a new address or to a new mm while
> read-locked;
> For case #1 current checks will help detecting cases when:
> - vma was reused but not yet added into the tree (detached check)
> - vma was reused at a different address range (address check);
> We are missing the check for vm_mm to ensure the reused vma was not
> attached to a different mm. This patch adds the missing check.
> For case #2, we pass mm to vma_start_read() to prevent access to
> unstable vma->vm_mm. This might lead to vma_start_read() returning
> a false locked result but that's not critical if it's rare because
> it will only lead to a retry under mmap_lock.
> For case #3, we ensure the order in which vma->detached flag and
> vm_start/vm_end/vm_mm are set and checked. vma gets attached after
> vm_start/vm_end/vm_mm were set and lock_vma_under_rcu() should check
> vma->detached before checking vm_start/vm_end/vm_mm. This is required
> because attaching vma happens without vma write-lock, as opposed to
> vma detaching, which requires vma write-lock. This patch adds memory
> barriers inside is_vma_detached() and vma_mark_attached() needed to
> order reads and writes to vma->detached vs vm_start/vm_end/vm_mm.
> After these provisions, SLAB_TYPESAFE_BY_RCU is added to vm_area_cachep.
> This will facilitate vm_area_struct reuse and will minimize the number
> of call_rcu() calls.
> 
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>

I'm wondering about the vma freeing path. Consider vma_complete():

vma_mark_detached(vp->remove);
  vma->detached = true; - plain write
vm_area_free(vp->remove);
  vma->vm_lock_seq = UINT_MAX; - plain write
  kmem_cache_free(vm_area_cachep)
...
potential reallocation

against:

lock_vma_under_rcu()
- mas_walk finds a stale vma due to race
vma_start_read()
  if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(mm->mm_lock_seq.sequence))
  - can be false, the vma was not being locked on the freeing side?
  down_read_trylock(&vma->vm_lock.lock) - suceeds, wasn't locked
    this is acquire, but was there any release?
  is_vma_detached() - false negative as the write above didn't propagate
    here yet; a read barrier but where is the write barrier?
  checks for vma->vm_mm, vm_start, vm_end - nobody reset them yet so false
    positive, or they got reset on reallocation but writes didn't propagate

Am I missing something that would prevent lock_vma_under_rcu() falsely
succeeding here?

Suren Baghdasaryan Dec. 10, 2024, 4:20 p.m. UTC | #6

On Tue, Dec 10, 2024 at 6:21 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 12/6/24 23:52, Suren Baghdasaryan wrote:
> > To enable SLAB_TYPESAFE_BY_RCU for vma cache we need to ensure that
> > object reuse before RCU grace period is over will be detected inside
> > lock_vma_under_rcu().
> > lock_vma_under_rcu() enters RCU read section, finds the vma at the
> > given address, locks the vma and checks if it got detached or remapped
> > to cover a different address range. These last checks are there
> > to ensure that the vma was not modified after we found it but before
> > locking it.
> > vma reuse introduces several new possibilities:
> > 1. vma can be reused after it was found but before it is locked;
> > 2. vma can be reused and reinitialized (including changing its vm_mm)
> > while being locked in vma_start_read();
> > 3. vma can be reused and reinitialized after it was found but before
> > it is locked, then attached at a new address or to a new mm while
> > read-locked;
> > For case #1 current checks will help detecting cases when:
> > - vma was reused but not yet added into the tree (detached check)
> > - vma was reused at a different address range (address check);
> > We are missing the check for vm_mm to ensure the reused vma was not
> > attached to a different mm. This patch adds the missing check.
> > For case #2, we pass mm to vma_start_read() to prevent access to
> > unstable vma->vm_mm. This might lead to vma_start_read() returning
> > a false locked result but that's not critical if it's rare because
> > it will only lead to a retry under mmap_lock.
> > For case #3, we ensure the order in which vma->detached flag and
> > vm_start/vm_end/vm_mm are set and checked. vma gets attached after
> > vm_start/vm_end/vm_mm were set and lock_vma_under_rcu() should check
> > vma->detached before checking vm_start/vm_end/vm_mm. This is required
> > because attaching vma happens without vma write-lock, as opposed to
> > vma detaching, which requires vma write-lock. This patch adds memory
> > barriers inside is_vma_detached() and vma_mark_attached() needed to
> > order reads and writes to vma->detached vs vm_start/vm_end/vm_mm.
> > After these provisions, SLAB_TYPESAFE_BY_RCU is added to vm_area_cachep.
> > This will facilitate vm_area_struct reuse and will minimize the number
> > of call_rcu() calls.
> >
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
>
> I'm wondering about the vma freeing path. Consider vma_complete():
>
> vma_mark_detached(vp->remove);
>   vma->detached = true; - plain write
> vm_area_free(vp->remove);
>   vma->vm_lock_seq = UINT_MAX; - plain write
>   kmem_cache_free(vm_area_cachep)
> ...
> potential reallocation
>
> against:
>
> lock_vma_under_rcu()
> - mas_walk finds a stale vma due to race
> vma_start_read()
>   if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(mm->mm_lock_seq.sequence))
>   - can be false, the vma was not being locked on the freeing side?
>   down_read_trylock(&vma->vm_lock.lock) - suceeds, wasn't locked
>     this is acquire, but was there any release?

Yes, there was a release. I think what you missed is that
vma_mark_detached() that is called from vma_complete() requires VMA to
be write-locked (see vma_assert_write_locked() in
vma_mark_detached()). The rule is that a VMA can be attached without
write-locking but only a write-locked VMA can be detached. So, after
vma_mark_detached() and before down_read_trylock(&vma->vm_lock.lock)
in vma_start_read() the VMA write-lock should have been released by
mmap_write_unlock() and therefore vma->detached=false should be
visible to the reader when it executed lock_vma_under_rcu().

>   is_vma_detached() - false negative as the write above didn't propagate
>     here yet; a read barrier but where is the write barrier?
>   checks for vma->vm_mm, vm_start, vm_end - nobody reset them yet so false
>     positive, or they got reset on reallocation but writes didn't propagate
>
> Am I missing something that would prevent lock_vma_under_rcu() falsely
> succeeding here?
>

Suren Baghdasaryan Dec. 10, 2024, 4:23 p.m. UTC | #7

On Tue, Dec 10, 2024 at 4:06 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 12/6/24 23:52, Suren Baghdasaryan wrote:
> > To enable SLAB_TYPESAFE_BY_RCU for vma cache we need to ensure that
> > object reuse before RCU grace period is over will be detected inside
> > lock_vma_under_rcu().
> > lock_vma_under_rcu() enters RCU read section, finds the vma at the
> > given address, locks the vma and checks if it got detached or remapped
> > to cover a different address range. These last checks are there
> > to ensure that the vma was not modified after we found it but before
> > locking it.
> > vma reuse introduces several new possibilities:
> > 1. vma can be reused after it was found but before it is locked;
> > 2. vma can be reused and reinitialized (including changing its vm_mm)
> > while being locked in vma_start_read();
> > 3. vma can be reused and reinitialized after it was found but before
> > it is locked, then attached at a new address or to a new mm while
> > read-locked;
> > For case #1 current checks will help detecting cases when:
> > - vma was reused but not yet added into the tree (detached check)
> > - vma was reused at a different address range (address check);
> > We are missing the check for vm_mm to ensure the reused vma was not
> > attached to a different mm. This patch adds the missing check.
> > For case #2, we pass mm to vma_start_read() to prevent access to
> > unstable vma->vm_mm. This might lead to vma_start_read() returning
> > a false locked result but that's not critical if it's rare because
> > it will only lead to a retry under mmap_lock.
> > For case #3, we ensure the order in which vma->detached flag and
> > vm_start/vm_end/vm_mm are set and checked. vma gets attached after
> > vm_start/vm_end/vm_mm were set and lock_vma_under_rcu() should check
> > vma->detached before checking vm_start/vm_end/vm_mm. This is required
> > because attaching vma happens without vma write-lock, as opposed to
> > vma detaching, which requires vma write-lock. This patch adds memory
> > barriers inside is_vma_detached() and vma_mark_attached() needed to
> > order reads and writes to vma->detached vs vm_start/vm_end/vm_mm.
> > After these provisions, SLAB_TYPESAFE_BY_RCU is added to vm_area_cachep.
> > This will facilitate vm_area_struct reuse and will minimize the number
> > of call_rcu() calls.
> >
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > ---
> >  include/linux/mm.h               |  36 +++++--
> >  include/linux/mm_types.h         |  10 +-
> >  include/linux/slab.h             |   6 --
> >  kernel/fork.c                    | 157 +++++++++++++++++++++++++------
> >  mm/memory.c                      |  15 ++-
> >  mm/vma.c                         |   2 +-
> >  tools/testing/vma/vma_internal.h |   7 +-
> >  7 files changed, 179 insertions(+), 54 deletions(-)
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 2bf38c1e9cca..3568bcbc7c81 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -257,7 +257,7 @@ struct vm_area_struct *vm_area_alloc(struct mm_struct *);
> >  struct vm_area_struct *vm_area_dup(struct vm_area_struct *);
> >  void vm_area_free(struct vm_area_struct *);
> >  /* Use only if VMA has no other users */
> > -void __vm_area_free(struct vm_area_struct *vma);
> > +void vm_area_free_unreachable(struct vm_area_struct *vma);
> >
> >  #ifndef CONFIG_MMU
> >  extern struct rb_root nommu_region_tree;
> > @@ -706,8 +706,10 @@ static inline void vma_lock_init(struct vm_area_struct *vma)
> >   * Try to read-lock a vma. The function is allowed to occasionally yield false
> >   * locked result to avoid performance overhead, in which case we fall back to
> >   * using mmap_lock. The function should never yield false unlocked result.
> > + * False locked result is possible if mm_lock_seq overflows or if vma gets
> > + * reused and attached to a different mm before we lock it.
> >   */
> > -static inline bool vma_start_read(struct vm_area_struct *vma)
> > +static inline bool vma_start_read(struct mm_struct *mm, struct vm_area_struct *vma)
> >  {
> >       /*
> >        * Check before locking. A race might cause false locked result.
> > @@ -716,7 +718,7 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
> >        * we don't rely on for anything - the mm_lock_seq read against which we
> >        * need ordering is below.
> >        */
> > -     if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(vma->vm_mm->mm_lock_seq.sequence))
> > +     if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(mm->mm_lock_seq.sequence))
> >               return false;
> >
> >       if (unlikely(down_read_trylock(&vma->vm_lock.lock) == 0))
> > @@ -733,7 +735,7 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
> >        * after it has been unlocked.
> >        * This pairs with RELEASE semantics in vma_end_write_all().
> >        */
> > -     if (unlikely(vma->vm_lock_seq == raw_read_seqcount(&vma->vm_mm->mm_lock_seq))) {
> > +     if (unlikely(vma->vm_lock_seq == raw_read_seqcount(&mm->mm_lock_seq))) {
> >               up_read(&vma->vm_lock.lock);
> >               return false;
> >       }
>
> This could have been perhaps another preparatory patch to make this one smaller?
>
> >
> > +static void vm_area_ctor(void *data)
> > +{
> > +     struct vm_area_struct *vma = (struct vm_area_struct *)data;
> > +
> > +#ifdef CONFIG_PER_VMA_LOCK
> > +     /* vma is not locked, can't use vma_mark_detached() */
> > +     vma->detached = true;
> > +#endif
> > +     INIT_LIST_HEAD(&vma->anon_vma_chain);
> > +     vma_lock_init(vma);
> > +}
> > +
> > +#ifdef CONFIG_PER_VMA_LOCK
> > +
> > +static void vma_clear(struct vm_area_struct *vma, struct mm_struct *mm)
> > +{
> > +     vma->vm_mm = mm;
> > +     vma->vm_ops = &vma_dummy_vm_ops;
> > +     vma->vm_start = 0;
> > +     vma->vm_end = 0;
> > +     vma->anon_vma = NULL;
> > +     vma->vm_pgoff = 0;
> > +     vma->vm_file = NULL;
> > +     vma->vm_private_data = NULL;
> > +     vm_flags_init(vma, 0);
> > +     memset(&vma->vm_page_prot, 0, sizeof(vma->vm_page_prot));
> > +     memset(&vma->shared, 0, sizeof(vma->shared));
> > +     memset(&vma->vm_userfaultfd_ctx, 0, sizeof(vma->vm_userfaultfd_ctx));
> > +     vma_numab_state_init(vma);
> > +#ifdef CONFIG_ANON_VMA_NAME
> > +     vma->anon_name = NULL;
> > +#endif
> > +#ifdef CONFIG_SWAP
> > +     memset(&vma->swap_readahead_info, 0, sizeof(vma->swap_readahead_info));
> > +#endif
> > +#ifndef CONFIG_MMU
> > +     vma->vm_region = NULL;
> > +#endif
> > +#ifdef CONFIG_NUMA
> > +     vma->vm_policy = NULL;
> > +#endif
>
> This isn't the ideal pattern I think, now that we have a ctor. Ideally the
> ctor would do all this (except setting the vm_mm), and then we need to make
> sure it's also done when freeing the vma, to make sure the freed object is
> in the same state as a new object after the constructor.
>
> On freeing, things like numab_state and anon_name could be NULL'd (by the
> respective destructors) only when they are non-NULL and thus freeing the
> objects pointed to. vm_policy and vm_file could perhaps be handled same way
> after some refactoring (see remove_vma()), vma_dummy_vm_ops are possibly
> already reset by vma_close(), etc.

Ok, let me look some more into it and see if I can improve and
simplify the initialization/freeing logic. Thanks!

>
> > +}
> > +
> > +static void vma_copy(const struct vm_area_struct *src, struct vm_area_struct *dest)
> > +{
> > +     dest->vm_mm = src->vm_mm;
> > +     dest->vm_ops = src->vm_ops;
> > +     dest->vm_start = src->vm_start;
> > +     dest->vm_end = src->vm_end;
> > +     dest->anon_vma = src->anon_vma;
> > +     dest->vm_pgoff = src->vm_pgoff;
> > +     dest->vm_file = src->vm_file;
> > +     dest->vm_private_data = src->vm_private_data;
> > +     vm_flags_init(dest, src->vm_flags);
> > +     memcpy(&dest->vm_page_prot, &src->vm_page_prot,
> > +            sizeof(dest->vm_page_prot));
> > +     memcpy(&dest->shared, &src->shared, sizeof(dest->shared));
> > +     memcpy(&dest->vm_userfaultfd_ctx, &src->vm_userfaultfd_ctx,
> > +            sizeof(dest->vm_userfaultfd_ctx));
> > +#ifdef CONFIG_ANON_VMA_NAME
> > +     dest->anon_name = src->anon_name;
> > +#endif
> > +#ifdef CONFIG_SWAP
> > +     memcpy(&dest->swap_readahead_info, &src->swap_readahead_info,
> > +            sizeof(dest->swap_readahead_info));
> > +#endif
> > +#ifndef CONFIG_MMU
> > +     dest->vm_region = src->vm_region;
> > +#endif
> > +#ifdef CONFIG_NUMA
> > +     dest->vm_policy = src->vm_policy;
> > +#endif
> > +}
> > +
> > +#else /* CONFIG_PER_VMA_LOCK */

Vlastimil Babka Dec. 10, 2024, 4:32 p.m. UTC | #8

On 12/10/24 17:20, Suren Baghdasaryan wrote:
> On Tue, Dec 10, 2024 at 6:21 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>>
>> On 12/6/24 23:52, Suren Baghdasaryan wrote:
>> > To enable SLAB_TYPESAFE_BY_RCU for vma cache we need to ensure that
>> > object reuse before RCU grace period is over will be detected inside
>> > lock_vma_under_rcu().
>> > lock_vma_under_rcu() enters RCU read section, finds the vma at the
>> > given address, locks the vma and checks if it got detached or remapped
>> > to cover a different address range. These last checks are there
>> > to ensure that the vma was not modified after we found it but before
>> > locking it.
>> > vma reuse introduces several new possibilities:
>> > 1. vma can be reused after it was found but before it is locked;
>> > 2. vma can be reused and reinitialized (including changing its vm_mm)
>> > while being locked in vma_start_read();
>> > 3. vma can be reused and reinitialized after it was found but before
>> > it is locked, then attached at a new address or to a new mm while
>> > read-locked;
>> > For case #1 current checks will help detecting cases when:
>> > - vma was reused but not yet added into the tree (detached check)
>> > - vma was reused at a different address range (address check);
>> > We are missing the check for vm_mm to ensure the reused vma was not
>> > attached to a different mm. This patch adds the missing check.
>> > For case #2, we pass mm to vma_start_read() to prevent access to
>> > unstable vma->vm_mm. This might lead to vma_start_read() returning
>> > a false locked result but that's not critical if it's rare because
>> > it will only lead to a retry under mmap_lock.
>> > For case #3, we ensure the order in which vma->detached flag and
>> > vm_start/vm_end/vm_mm are set and checked. vma gets attached after
>> > vm_start/vm_end/vm_mm were set and lock_vma_under_rcu() should check
>> > vma->detached before checking vm_start/vm_end/vm_mm. This is required
>> > because attaching vma happens without vma write-lock, as opposed to
>> > vma detaching, which requires vma write-lock. This patch adds memory
>> > barriers inside is_vma_detached() and vma_mark_attached() needed to
>> > order reads and writes to vma->detached vs vm_start/vm_end/vm_mm.
>> > After these provisions, SLAB_TYPESAFE_BY_RCU is added to vm_area_cachep.
>> > This will facilitate vm_area_struct reuse and will minimize the number
>> > of call_rcu() calls.
>> >
>> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
>>
>> I'm wondering about the vma freeing path. Consider vma_complete():
>>
>> vma_mark_detached(vp->remove);
>>   vma->detached = true; - plain write
>> vm_area_free(vp->remove);
>>   vma->vm_lock_seq = UINT_MAX; - plain write
>>   kmem_cache_free(vm_area_cachep)
>> ...
>> potential reallocation
>>
>> against:
>>
>> lock_vma_under_rcu()
>> - mas_walk finds a stale vma due to race
>> vma_start_read()
>>   if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(mm->mm_lock_seq.sequence))
>>   - can be false, the vma was not being locked on the freeing side?
>>   down_read_trylock(&vma->vm_lock.lock) - suceeds, wasn't locked
>>     this is acquire, but was there any release?
> 
> Yes, there was a release. I think what you missed is that
> vma_mark_detached() that is called from vma_complete() requires VMA to
> be write-locked (see vma_assert_write_locked() in
> vma_mark_detached()). The rule is that a VMA can be attached without
> write-locking but only a write-locked VMA can be detached. So, after

OK but write unlocking means the mm's seqcount is bumped and becomes
non-equal with vma's vma->vm_lock_seq, right?

Yet in the example above we happily set it to UINT_MAX and thus effectively
false unlock it for vma_start_read()?

And this is all done before the vma_complete() side would actually reach
mmap_write_unlock(), AFAICS.

> vma_mark_detached() and before down_read_trylock(&vma->vm_lock.lock)
> in vma_start_read() the VMA write-lock should have been released by
> mmap_write_unlock() and therefore vma->detached=false should be
> visible to the reader when it executed lock_vma_under_rcu().
> 
>>   is_vma_detached() - false negative as the write above didn't propagate
>>     here yet; a read barrier but where is the write barrier?
>>   checks for vma->vm_mm, vm_start, vm_end - nobody reset them yet so false
>>     positive, or they got reset on reallocation but writes didn't propagate
>>
>> Am I missing something that would prevent lock_vma_under_rcu() falsely
>> succeeding here?
>>

Suren Baghdasaryan Dec. 10, 2024, 5:16 p.m. UTC | #9

On Tue, Dec 10, 2024 at 8:32 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 12/10/24 17:20, Suren Baghdasaryan wrote:
> > On Tue, Dec 10, 2024 at 6:21 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> >>
> >> On 12/6/24 23:52, Suren Baghdasaryan wrote:
> >> > To enable SLAB_TYPESAFE_BY_RCU for vma cache we need to ensure that
> >> > object reuse before RCU grace period is over will be detected inside
> >> > lock_vma_under_rcu().
> >> > lock_vma_under_rcu() enters RCU read section, finds the vma at the
> >> > given address, locks the vma and checks if it got detached or remapped
> >> > to cover a different address range. These last checks are there
> >> > to ensure that the vma was not modified after we found it but before
> >> > locking it.
> >> > vma reuse introduces several new possibilities:
> >> > 1. vma can be reused after it was found but before it is locked;
> >> > 2. vma can be reused and reinitialized (including changing its vm_mm)
> >> > while being locked in vma_start_read();
> >> > 3. vma can be reused and reinitialized after it was found but before
> >> > it is locked, then attached at a new address or to a new mm while
> >> > read-locked;
> >> > For case #1 current checks will help detecting cases when:
> >> > - vma was reused but not yet added into the tree (detached check)
> >> > - vma was reused at a different address range (address check);
> >> > We are missing the check for vm_mm to ensure the reused vma was not
> >> > attached to a different mm. This patch adds the missing check.
> >> > For case #2, we pass mm to vma_start_read() to prevent access to
> >> > unstable vma->vm_mm. This might lead to vma_start_read() returning
> >> > a false locked result but that's not critical if it's rare because
> >> > it will only lead to a retry under mmap_lock.
> >> > For case #3, we ensure the order in which vma->detached flag and
> >> > vm_start/vm_end/vm_mm are set and checked. vma gets attached after
> >> > vm_start/vm_end/vm_mm were set and lock_vma_under_rcu() should check
> >> > vma->detached before checking vm_start/vm_end/vm_mm. This is required
> >> > because attaching vma happens without vma write-lock, as opposed to
> >> > vma detaching, which requires vma write-lock. This patch adds memory
> >> > barriers inside is_vma_detached() and vma_mark_attached() needed to
> >> > order reads and writes to vma->detached vs vm_start/vm_end/vm_mm.
> >> > After these provisions, SLAB_TYPESAFE_BY_RCU is added to vm_area_cachep.
> >> > This will facilitate vm_area_struct reuse and will minimize the number
> >> > of call_rcu() calls.
> >> >
> >> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> >>
> >> I'm wondering about the vma freeing path. Consider vma_complete():
> >>
> >> vma_mark_detached(vp->remove);
> >>   vma->detached = true; - plain write
> >> vm_area_free(vp->remove);
> >>   vma->vm_lock_seq = UINT_MAX; - plain write
> >>   kmem_cache_free(vm_area_cachep)
> >> ...
> >> potential reallocation
> >>
> >> against:
> >>
> >> lock_vma_under_rcu()
> >> - mas_walk finds a stale vma due to race
> >> vma_start_read()
> >>   if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(mm->mm_lock_seq.sequence))
> >>   - can be false, the vma was not being locked on the freeing side?
> >>   down_read_trylock(&vma->vm_lock.lock) - suceeds, wasn't locked
> >>     this is acquire, but was there any release?
> >
> > Yes, there was a release. I think what you missed is that
> > vma_mark_detached() that is called from vma_complete() requires VMA to
> > be write-locked (see vma_assert_write_locked() in
> > vma_mark_detached()). The rule is that a VMA can be attached without
> > write-locking but only a write-locked VMA can be detached. So, after
>
> OK but write unlocking means the mm's seqcount is bumped and becomes
> non-equal with vma's vma->vm_lock_seq, right?
>
> Yet in the example above we happily set it to UINT_MAX and thus effectively
> false unlock it for vma_start_read()?
>
> And this is all done before the vma_complete() side would actually reach
> mmap_write_unlock(), AFAICS.

Ah, you are right. With the possibility of reuse, even a freed VMA
should be kept write-locked until it is unlocked by
mmap_write_unlock(). I think the fix for this is simply to not reset
vma->vm_lock_seq inside vm_area_free(). I'll also need to add a
comment for vm_lock_seq explaining these requirements.
Do you agree that such a change would resolve the issue?

>
> > vma_mark_detached() and before down_read_trylock(&vma->vm_lock.lock)
> > in vma_start_read() the VMA write-lock should have been released by
> > mmap_write_unlock() and therefore vma->detached=false should be
> > visible to the reader when it executed lock_vma_under_rcu().
> >
> >>   is_vma_detached() - false negative as the write above didn't propagate
> >>     here yet; a read barrier but where is the write barrier?
> >>   checks for vma->vm_mm, vm_start, vm_end - nobody reset them yet so false
> >>     positive, or they got reset on reallocation but writes didn't propagate
> >>
> >> Am I missing something that would prevent lock_vma_under_rcu() falsely
> >> succeeding here?
> >>
>

Vlastimil Babka Dec. 10, 2024, 5:25 p.m. UTC | #10

On 12/10/24 18:16, Suren Baghdasaryan wrote:
> On Tue, Dec 10, 2024 at 8:32 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>>
>> On 12/10/24 17:20, Suren Baghdasaryan wrote:
>> > On Tue, Dec 10, 2024 at 6:21 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>> >>
>> >> On 12/6/24 23:52, Suren Baghdasaryan wrote:
>> >> > To enable SLAB_TYPESAFE_BY_RCU for vma cache we need to ensure that
>> >> > object reuse before RCU grace period is over will be detected inside
>> >> > lock_vma_under_rcu().
>> >> > lock_vma_under_rcu() enters RCU read section, finds the vma at the
>> >> > given address, locks the vma and checks if it got detached or remapped
>> >> > to cover a different address range. These last checks are there
>> >> > to ensure that the vma was not modified after we found it but before
>> >> > locking it.
>> >> > vma reuse introduces several new possibilities:
>> >> > 1. vma can be reused after it was found but before it is locked;
>> >> > 2. vma can be reused and reinitialized (including changing its vm_mm)
>> >> > while being locked in vma_start_read();
>> >> > 3. vma can be reused and reinitialized after it was found but before
>> >> > it is locked, then attached at a new address or to a new mm while
>> >> > read-locked;
>> >> > For case #1 current checks will help detecting cases when:
>> >> > - vma was reused but not yet added into the tree (detached check)
>> >> > - vma was reused at a different address range (address check);
>> >> > We are missing the check for vm_mm to ensure the reused vma was not
>> >> > attached to a different mm. This patch adds the missing check.
>> >> > For case #2, we pass mm to vma_start_read() to prevent access to
>> >> > unstable vma->vm_mm. This might lead to vma_start_read() returning
>> >> > a false locked result but that's not critical if it's rare because
>> >> > it will only lead to a retry under mmap_lock.
>> >> > For case #3, we ensure the order in which vma->detached flag and
>> >> > vm_start/vm_end/vm_mm are set and checked. vma gets attached after
>> >> > vm_start/vm_end/vm_mm were set and lock_vma_under_rcu() should check
>> >> > vma->detached before checking vm_start/vm_end/vm_mm. This is required
>> >> > because attaching vma happens without vma write-lock, as opposed to
>> >> > vma detaching, which requires vma write-lock. This patch adds memory
>> >> > barriers inside is_vma_detached() and vma_mark_attached() needed to
>> >> > order reads and writes to vma->detached vs vm_start/vm_end/vm_mm.
>> >> > After these provisions, SLAB_TYPESAFE_BY_RCU is added to vm_area_cachep.
>> >> > This will facilitate vm_area_struct reuse and will minimize the number
>> >> > of call_rcu() calls.
>> >> >
>> >> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
>> >>
>> >> I'm wondering about the vma freeing path. Consider vma_complete():
>> >>
>> >> vma_mark_detached(vp->remove);
>> >>   vma->detached = true; - plain write
>> >> vm_area_free(vp->remove);
>> >>   vma->vm_lock_seq = UINT_MAX; - plain write
>> >>   kmem_cache_free(vm_area_cachep)
>> >> ...
>> >> potential reallocation
>> >>
>> >> against:
>> >>
>> >> lock_vma_under_rcu()
>> >> - mas_walk finds a stale vma due to race
>> >> vma_start_read()
>> >>   if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(mm->mm_lock_seq.sequence))
>> >>   - can be false, the vma was not being locked on the freeing side?
>> >>   down_read_trylock(&vma->vm_lock.lock) - suceeds, wasn't locked
>> >>     this is acquire, but was there any release?
>> >
>> > Yes, there was a release. I think what you missed is that
>> > vma_mark_detached() that is called from vma_complete() requires VMA to
>> > be write-locked (see vma_assert_write_locked() in
>> > vma_mark_detached()). The rule is that a VMA can be attached without
>> > write-locking but only a write-locked VMA can be detached. So, after
>>
>> OK but write unlocking means the mm's seqcount is bumped and becomes
>> non-equal with vma's vma->vm_lock_seq, right?
>>
>> Yet in the example above we happily set it to UINT_MAX and thus effectively
>> false unlock it for vma_start_read()?
>>
>> And this is all done before the vma_complete() side would actually reach
>> mmap_write_unlock(), AFAICS.
> 
> Ah, you are right. With the possibility of reuse, even a freed VMA
> should be kept write-locked until it is unlocked by
> mmap_write_unlock(). I think the fix for this is simply to not reset
> vma->vm_lock_seq inside vm_area_free(). I'll also need to add a

But even if we don't reset vm_lock_seq to UINT_MAX, then whover reallocated
it can proceed and end up doing a vma_start_write() and rewrite it there
anyway, no?

> comment for vm_lock_seq explaining these requirements.
> Do you agree that such a change would resolve the issue?
> 
>>
>> > vma_mark_detached() and before down_read_trylock(&vma->vm_lock.lock)
>> > in vma_start_read() the VMA write-lock should have been released by
>> > mmap_write_unlock() and therefore vma->detached=false should be
>> > visible to the reader when it executed lock_vma_under_rcu().
>> >
>> >>   is_vma_detached() - false negative as the write above didn't propagate
>> >>     here yet; a read barrier but where is the write barrier?
>> >>   checks for vma->vm_mm, vm_start, vm_end - nobody reset them yet so false
>> >>     positive, or they got reset on reallocation but writes didn't propagate
>> >>
>> >> Am I missing something that would prevent lock_vma_under_rcu() falsely
>> >> succeeding here?
>> >>
>>

Suren Baghdasaryan Dec. 10, 2024, 6:53 p.m. UTC | #11

On Tue, Dec 10, 2024 at 9:25 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 12/10/24 18:16, Suren Baghdasaryan wrote:
> > On Tue, Dec 10, 2024 at 8:32 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> >>
> >> On 12/10/24 17:20, Suren Baghdasaryan wrote:
> >> > On Tue, Dec 10, 2024 at 6:21 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> >> >>
> >> >> On 12/6/24 23:52, Suren Baghdasaryan wrote:
> >> >> > To enable SLAB_TYPESAFE_BY_RCU for vma cache we need to ensure that
> >> >> > object reuse before RCU grace period is over will be detected inside
> >> >> > lock_vma_under_rcu().
> >> >> > lock_vma_under_rcu() enters RCU read section, finds the vma at the
> >> >> > given address, locks the vma and checks if it got detached or remapped
> >> >> > to cover a different address range. These last checks are there
> >> >> > to ensure that the vma was not modified after we found it but before
> >> >> > locking it.
> >> >> > vma reuse introduces several new possibilities:
> >> >> > 1. vma can be reused after it was found but before it is locked;
> >> >> > 2. vma can be reused and reinitialized (including changing its vm_mm)
> >> >> > while being locked in vma_start_read();
> >> >> > 3. vma can be reused and reinitialized after it was found but before
> >> >> > it is locked, then attached at a new address or to a new mm while
> >> >> > read-locked;
> >> >> > For case #1 current checks will help detecting cases when:
> >> >> > - vma was reused but not yet added into the tree (detached check)
> >> >> > - vma was reused at a different address range (address check);
> >> >> > We are missing the check for vm_mm to ensure the reused vma was not
> >> >> > attached to a different mm. This patch adds the missing check.
> >> >> > For case #2, we pass mm to vma_start_read() to prevent access to
> >> >> > unstable vma->vm_mm. This might lead to vma_start_read() returning
> >> >> > a false locked result but that's not critical if it's rare because
> >> >> > it will only lead to a retry under mmap_lock.
> >> >> > For case #3, we ensure the order in which vma->detached flag and
> >> >> > vm_start/vm_end/vm_mm are set and checked. vma gets attached after
> >> >> > vm_start/vm_end/vm_mm were set and lock_vma_under_rcu() should check
> >> >> > vma->detached before checking vm_start/vm_end/vm_mm. This is required
> >> >> > because attaching vma happens without vma write-lock, as opposed to
> >> >> > vma detaching, which requires vma write-lock. This patch adds memory
> >> >> > barriers inside is_vma_detached() and vma_mark_attached() needed to
> >> >> > order reads and writes to vma->detached vs vm_start/vm_end/vm_mm.
> >> >> > After these provisions, SLAB_TYPESAFE_BY_RCU is added to vm_area_cachep.
> >> >> > This will facilitate vm_area_struct reuse and will minimize the number
> >> >> > of call_rcu() calls.
> >> >> >
> >> >> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> >> >>
> >> >> I'm wondering about the vma freeing path. Consider vma_complete():
> >> >>
> >> >> vma_mark_detached(vp->remove);
> >> >>   vma->detached = true; - plain write
> >> >> vm_area_free(vp->remove);
> >> >>   vma->vm_lock_seq = UINT_MAX; - plain write
> >> >>   kmem_cache_free(vm_area_cachep)
> >> >> ...
> >> >> potential reallocation
> >> >>
> >> >> against:
> >> >>
> >> >> lock_vma_under_rcu()
> >> >> - mas_walk finds a stale vma due to race
> >> >> vma_start_read()
> >> >>   if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(mm->mm_lock_seq.sequence))
> >> >>   - can be false, the vma was not being locked on the freeing side?
> >> >>   down_read_trylock(&vma->vm_lock.lock) - suceeds, wasn't locked
> >> >>     this is acquire, but was there any release?
> >> >
> >> > Yes, there was a release. I think what you missed is that
> >> > vma_mark_detached() that is called from vma_complete() requires VMA to
> >> > be write-locked (see vma_assert_write_locked() in
> >> > vma_mark_detached()). The rule is that a VMA can be attached without
> >> > write-locking but only a write-locked VMA can be detached. So, after
> >>
> >> OK but write unlocking means the mm's seqcount is bumped and becomes
> >> non-equal with vma's vma->vm_lock_seq, right?
> >>
> >> Yet in the example above we happily set it to UINT_MAX and thus effectively
> >> false unlock it for vma_start_read()?
> >>
> >> And this is all done before the vma_complete() side would actually reach
> >> mmap_write_unlock(), AFAICS.
> >
> > Ah, you are right. With the possibility of reuse, even a freed VMA
> > should be kept write-locked until it is unlocked by
> > mmap_write_unlock(). I think the fix for this is simply to not reset
> > vma->vm_lock_seq inside vm_area_free(). I'll also need to add a
>
> But even if we don't reset vm_lock_seq to UINT_MAX, then whover reallocated
> it can proceed and end up doing a vma_start_write() and rewrite it there
> anyway, no?

Ugh, yes. It looks like we will need a write memory barrier in
vma_mark_detached() to make it immediately visible.

Case 1:
                                      lock_vma_under_rcu()
vma_complete()
    vma_mark_detached(vp->remove);
        vma->detached = true;
        smp_wmb();
    vm_area_free(vp->remove);
        vma->vm_lock_seq = UINT_MAX;

                                       vma_start_read()
                                          is_vma_detached() <<- abort

       kmem_cache_free(vm_area_cachep);
mmap_write_unlock()


Case 2:
                                      lock_vma_under_rcu()
vma_complete()
    vma_mark_detached(vp->remove);
        vma->detached = true;
        smp_wmb();
    vm_area_free(vp->remove);
        vma->vm_lock_seq = UINT_MAX;

                                       vma_start_read()

        kmem_cache_free(vm_area_cachep);
mmap_write_unlock() // changes mm->mm_lock_seq but does not matter
                    // reader holds vma->vm_lock, so new writers have to wait
    ...
    vm_area_alloc();
    // sets all vma attributes
    vma_mark_attached();
        smp_wmb();

                                           // if is_vma_detached() called
                                           //  before this point, we will
                                           // abort like in Case 1

        vma->detached = true;
                                           is_vma_detached()
                                           // check vm_mm, vm_start, vm_end
                                           // if all checks pass, this is a
                                           // new attached vma and it's
                                           // read-locked (can't be modified)

Did I miss any other race?

>
> > comment for vm_lock_seq explaining these requirements.
> > Do you agree that such a change would resolve the issue?
> >
> >>
> >> > vma_mark_detached() and before down_read_trylock(&vma->vm_lock.lock)
> >> > in vma_start_read() the VMA write-lock should have been released by
> >> > mmap_write_unlock() and therefore vma->detached=false should be
> >> > visible to the reader when it executed lock_vma_under_rcu().
> >> >
> >> >>   is_vma_detached() - false negative as the write above didn't propagate
> >> >>     here yet; a read barrier but where is the write barrier?
> >> >>   checks for vma->vm_mm, vm_start, vm_end - nobody reset them yet so false
> >> >>     positive, or they got reset on reallocation but writes didn't propagate
> >> >>
> >> >> Am I missing something that would prevent lock_vma_under_rcu() falsely
> >> >> succeeding here?
> >> >>
> >>
>

Suren Baghdasaryan Dec. 10, 2024, 11:01 p.m. UTC | #12

On Tue, Dec 10, 2024 at 9:25 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 12/10/24 18:16, Suren Baghdasaryan wrote:
> > On Tue, Dec 10, 2024 at 8:32 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> >>
> >> On 12/10/24 17:20, Suren Baghdasaryan wrote:
> >> > On Tue, Dec 10, 2024 at 6:21 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> >> >>
> >> >> On 12/6/24 23:52, Suren Baghdasaryan wrote:
> >> >> > To enable SLAB_TYPESAFE_BY_RCU for vma cache we need to ensure that
> >> >> > object reuse before RCU grace period is over will be detected inside
> >> >> > lock_vma_under_rcu().
> >> >> > lock_vma_under_rcu() enters RCU read section, finds the vma at the
> >> >> > given address, locks the vma and checks if it got detached or remapped
> >> >> > to cover a different address range. These last checks are there
> >> >> > to ensure that the vma was not modified after we found it but before
> >> >> > locking it.
> >> >> > vma reuse introduces several new possibilities:
> >> >> > 1. vma can be reused after it was found but before it is locked;
> >> >> > 2. vma can be reused and reinitialized (including changing its vm_mm)
> >> >> > while being locked in vma_start_read();
> >> >> > 3. vma can be reused and reinitialized after it was found but before
> >> >> > it is locked, then attached at a new address or to a new mm while
> >> >> > read-locked;
> >> >> > For case #1 current checks will help detecting cases when:
> >> >> > - vma was reused but not yet added into the tree (detached check)
> >> >> > - vma was reused at a different address range (address check);
> >> >> > We are missing the check for vm_mm to ensure the reused vma was not
> >> >> > attached to a different mm. This patch adds the missing check.
> >> >> > For case #2, we pass mm to vma_start_read() to prevent access to
> >> >> > unstable vma->vm_mm. This might lead to vma_start_read() returning
> >> >> > a false locked result but that's not critical if it's rare because
> >> >> > it will only lead to a retry under mmap_lock.
> >> >> > For case #3, we ensure the order in which vma->detached flag and
> >> >> > vm_start/vm_end/vm_mm are set and checked. vma gets attached after
> >> >> > vm_start/vm_end/vm_mm were set and lock_vma_under_rcu() should check
> >> >> > vma->detached before checking vm_start/vm_end/vm_mm. This is required
> >> >> > because attaching vma happens without vma write-lock, as opposed to
> >> >> > vma detaching, which requires vma write-lock. This patch adds memory
> >> >> > barriers inside is_vma_detached() and vma_mark_attached() needed to
> >> >> > order reads and writes to vma->detached vs vm_start/vm_end/vm_mm.
> >> >> > After these provisions, SLAB_TYPESAFE_BY_RCU is added to vm_area_cachep.
> >> >> > This will facilitate vm_area_struct reuse and will minimize the number
> >> >> > of call_rcu() calls.
> >> >> >
> >> >> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> >> >>
> >> >> I'm wondering about the vma freeing path. Consider vma_complete():
> >> >>
> >> >> vma_mark_detached(vp->remove);
> >> >>   vma->detached = true; - plain write
> >> >> vm_area_free(vp->remove);
> >> >>   vma->vm_lock_seq = UINT_MAX; - plain write
> >> >>   kmem_cache_free(vm_area_cachep)
> >> >> ...
> >> >> potential reallocation
> >> >>
> >> >> against:
> >> >>
> >> >> lock_vma_under_rcu()
> >> >> - mas_walk finds a stale vma due to race
> >> >> vma_start_read()
> >> >>   if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(mm->mm_lock_seq.sequence))
> >> >>   - can be false, the vma was not being locked on the freeing side?
> >> >>   down_read_trylock(&vma->vm_lock.lock) - suceeds, wasn't locked
> >> >>     this is acquire, but was there any release?
> >> >
> >> > Yes, there was a release. I think what you missed is that
> >> > vma_mark_detached() that is called from vma_complete() requires VMA to
> >> > be write-locked (see vma_assert_write_locked() in
> >> > vma_mark_detached()). The rule is that a VMA can be attached without
> >> > write-locking but only a write-locked VMA can be detached. So, after
> >>
> >> OK but write unlocking means the mm's seqcount is bumped and becomes
> >> non-equal with vma's vma->vm_lock_seq, right?
> >>
> >> Yet in the example above we happily set it to UINT_MAX and thus effectively
> >> false unlock it for vma_start_read()?
> >>
> >> And this is all done before the vma_complete() side would actually reach
> >> mmap_write_unlock(), AFAICS.
> >
> > Ah, you are right. With the possibility of reuse, even a freed VMA
> > should be kept write-locked until it is unlocked by
> > mmap_write_unlock(). I think the fix for this is simply to not reset
> > vma->vm_lock_seq inside vm_area_free(). I'll also need to add a
>
> But even if we don't reset vm_lock_seq to UINT_MAX, then whover reallocated
> it can proceed and end up doing a vma_start_write() and rewrite it there
> anyway, no?

Actually, I think with a small change we can simplify these locking rules:

static inline void vma_start_write(struct vm_area_struct *vma)
{
        int mm_lock_seq;

-        if (__is_vma_write_locked(vma, &mm_lock_seq))
-                return;
+        mmap_assert_write_locked(vma->vm_mm);
+        mm_lock_seq = vma->vm_mm->mm_lock_seq;

        down_write(&vma->vm_lock->lock);
        /*
        * We should use WRITE_ONCE() here because we can have concurrent reads
        * from the early lockless pessimistic check in vma_start_read().
        * We don't really care about the correctness of that early check, but
        * we should use WRITE_ONCE() for cleanliness and to keep KCSAN happy.
        */
        WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq);
        up_write(&vma->vm_lock->lock);
}

This will force vma_start_write() to always write-lock vma->vm_lock
before changing vma->vm_lock_seq. Since vma->vm_lock survives reuse,
the other readers/writers will synchronize on it even if vma got
reused.

>
> > comment for vm_lock_seq explaining these requirements.
> > Do you agree that such a change would resolve the issue?
> >
> >>
> >> > vma_mark_detached() and before down_read_trylock(&vma->vm_lock.lock)
> >> > in vma_start_read() the VMA write-lock should have been released by
> >> > mmap_write_unlock() and therefore vma->detached=false should be
> >> > visible to the reader when it executed lock_vma_under_rcu().
> >> >
> >> >>   is_vma_detached() - false negative as the write above didn't propagate
> >> >>     here yet; a read barrier but where is the write barrier?
> >> >>   checks for vma->vm_mm, vm_start, vm_end - nobody reset them yet so false
> >> >>     positive, or they got reset on reallocation but writes didn't propagate
> >> >>
> >> >> Am I missing something that would prevent lock_vma_under_rcu() falsely
> >> >> succeeding here?
> >> >>
> >>
>

Suren Baghdasaryan Dec. 11, 2024, 3:30 p.m. UTC | #13

On Tue, Dec 10, 2024 at 3:01 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Tue, Dec 10, 2024 at 9:25 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> >
> > On 12/10/24 18:16, Suren Baghdasaryan wrote:
> > > On Tue, Dec 10, 2024 at 8:32 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> > >>
> > >> On 12/10/24 17:20, Suren Baghdasaryan wrote:
> > >> > On Tue, Dec 10, 2024 at 6:21 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> > >> >>
> > >> >> On 12/6/24 23:52, Suren Baghdasaryan wrote:
> > >> >> > To enable SLAB_TYPESAFE_BY_RCU for vma cache we need to ensure that
> > >> >> > object reuse before RCU grace period is over will be detected inside
> > >> >> > lock_vma_under_rcu().
> > >> >> > lock_vma_under_rcu() enters RCU read section, finds the vma at the
> > >> >> > given address, locks the vma and checks if it got detached or remapped
> > >> >> > to cover a different address range. These last checks are there
> > >> >> > to ensure that the vma was not modified after we found it but before
> > >> >> > locking it.
> > >> >> > vma reuse introduces several new possibilities:
> > >> >> > 1. vma can be reused after it was found but before it is locked;
> > >> >> > 2. vma can be reused and reinitialized (including changing its vm_mm)
> > >> >> > while being locked in vma_start_read();
> > >> >> > 3. vma can be reused and reinitialized after it was found but before
> > >> >> > it is locked, then attached at a new address or to a new mm while
> > >> >> > read-locked;
> > >> >> > For case #1 current checks will help detecting cases when:
> > >> >> > - vma was reused but not yet added into the tree (detached check)
> > >> >> > - vma was reused at a different address range (address check);
> > >> >> > We are missing the check for vm_mm to ensure the reused vma was not
> > >> >> > attached to a different mm. This patch adds the missing check.
> > >> >> > For case #2, we pass mm to vma_start_read() to prevent access to
> > >> >> > unstable vma->vm_mm. This might lead to vma_start_read() returning
> > >> >> > a false locked result but that's not critical if it's rare because
> > >> >> > it will only lead to a retry under mmap_lock.
> > >> >> > For case #3, we ensure the order in which vma->detached flag and
> > >> >> > vm_start/vm_end/vm_mm are set and checked. vma gets attached after
> > >> >> > vm_start/vm_end/vm_mm were set and lock_vma_under_rcu() should check
> > >> >> > vma->detached before checking vm_start/vm_end/vm_mm. This is required
> > >> >> > because attaching vma happens without vma write-lock, as opposed to
> > >> >> > vma detaching, which requires vma write-lock. This patch adds memory
> > >> >> > barriers inside is_vma_detached() and vma_mark_attached() needed to
> > >> >> > order reads and writes to vma->detached vs vm_start/vm_end/vm_mm.
> > >> >> > After these provisions, SLAB_TYPESAFE_BY_RCU is added to vm_area_cachep.
> > >> >> > This will facilitate vm_area_struct reuse and will minimize the number
> > >> >> > of call_rcu() calls.
> > >> >> >
> > >> >> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > >> >>
> > >> >> I'm wondering about the vma freeing path. Consider vma_complete():
> > >> >>
> > >> >> vma_mark_detached(vp->remove);
> > >> >>   vma->detached = true; - plain write
> > >> >> vm_area_free(vp->remove);
> > >> >>   vma->vm_lock_seq = UINT_MAX; - plain write
> > >> >>   kmem_cache_free(vm_area_cachep)
> > >> >> ...
> > >> >> potential reallocation
> > >> >>
> > >> >> against:
> > >> >>
> > >> >> lock_vma_under_rcu()
> > >> >> - mas_walk finds a stale vma due to race
> > >> >> vma_start_read()
> > >> >>   if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(mm->mm_lock_seq.sequence))
> > >> >>   - can be false, the vma was not being locked on the freeing side?
> > >> >>   down_read_trylock(&vma->vm_lock.lock) - suceeds, wasn't locked
> > >> >>     this is acquire, but was there any release?
> > >> >
> > >> > Yes, there was a release. I think what you missed is that
> > >> > vma_mark_detached() that is called from vma_complete() requires VMA to
> > >> > be write-locked (see vma_assert_write_locked() in
> > >> > vma_mark_detached()). The rule is that a VMA can be attached without
> > >> > write-locking but only a write-locked VMA can be detached. So, after
> > >>
> > >> OK but write unlocking means the mm's seqcount is bumped and becomes
> > >> non-equal with vma's vma->vm_lock_seq, right?
> > >>
> > >> Yet in the example above we happily set it to UINT_MAX and thus effectively
> > >> false unlock it for vma_start_read()?
> > >>
> > >> And this is all done before the vma_complete() side would actually reach
> > >> mmap_write_unlock(), AFAICS.
> > >
> > > Ah, you are right. With the possibility of reuse, even a freed VMA
> > > should be kept write-locked until it is unlocked by
> > > mmap_write_unlock(). I think the fix for this is simply to not reset
> > > vma->vm_lock_seq inside vm_area_free(). I'll also need to add a
> >
> > But even if we don't reset vm_lock_seq to UINT_MAX, then whover reallocated
> > it can proceed and end up doing a vma_start_write() and rewrite it there
> > anyway, no?
>
> Actually, I think with a small change we can simplify these locking rules:
>
> static inline void vma_start_write(struct vm_area_struct *vma)
> {
>         int mm_lock_seq;
>
> -        if (__is_vma_write_locked(vma, &mm_lock_seq))
> -                return;
> +        mmap_assert_write_locked(vma->vm_mm);
> +        mm_lock_seq = vma->vm_mm->mm_lock_seq;
>
>         down_write(&vma->vm_lock->lock);
>         /*
>         * We should use WRITE_ONCE() here because we can have concurrent reads
>         * from the early lockless pessimistic check in vma_start_read().
>         * We don't really care about the correctness of that early check, but
>         * we should use WRITE_ONCE() for cleanliness and to keep KCSAN happy.
>         */
>         WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq);
>         up_write(&vma->vm_lock->lock);
> }
>
> This will force vma_start_write() to always write-lock vma->vm_lock
> before changing vma->vm_lock_seq. Since vma->vm_lock survives reuse,
> the other readers/writers will synchronize on it even if vma got
> reused.

After thinking of all the alternatives, I think the cleanest way to
handle vma detaching would be to follow the same pattern as for vma
attaching. To attach a vma we do:

vma->vm_mm = xxx;
...
vma_mark_attached()
    smp_wmb();
    WRITE_ONCE(vma->detached, false);


lock_vma_under_rcu() ensures that a vma is attached and still
unchanged like this:

lock_vma_under_rcu()
    vma_start_read();
    is_vma_detached()
        detached = READ_ONCE(vma->detached);
        smp_rmb();
    if (vma->vm_mm != mm)

So, vm_area_free() can follow the same pattern to ensure vma reuse
gets detected even if lock_vma_under_rcu() succeeds in locking the
vma:

vm_area_free()
    vma->vm_mm = NULL;
    smp_wmb();
    WRITE_ONCE(vma->detached, true);

Vlastimil, I think that should address the race you described. WDYT?

>
> >
> > > comment for vm_lock_seq explaining these requirements.
> > > Do you agree that such a change would resolve the issue?
> > >
> > >>
> > >> > vma_mark_detached() and before down_read_trylock(&vma->vm_lock.lock)
> > >> > in vma_start_read() the VMA write-lock should have been released by
> > >> > mmap_write_unlock() and therefore vma->detached=false should be
> > >> > visible to the reader when it executed lock_vma_under_rcu().
> > >> >
> > >> >>   is_vma_detached() - false negative as the write above didn't propagate
> > >> >>     here yet; a read barrier but where is the write barrier?
> > >> >>   checks for vma->vm_mm, vm_start, vm_end - nobody reset them yet so false
> > >> >>     positive, or they got reset on reallocation but writes didn't propagate
> > >> >>
> > >> >> Am I missing something that would prevent lock_vma_under_rcu() falsely
> > >> >> succeeding here?
> > >> >>
> > >>
> >

Vlastimil Babka Dec. 11, 2024, 4:05 p.m. UTC | #14

On 12/11/24 16:30, Suren Baghdasaryan wrote:
> On Tue, Dec 10, 2024 at 3:01 PM Suren Baghdasaryan <surenb@google.com> wrote:
>>
>> On Tue, Dec 10, 2024 at 9:25 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>> >
>> > On 12/10/24 18:16, Suren Baghdasaryan wrote:
>> > > On Tue, Dec 10, 2024 at 8:32 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>> > >>
>> > >> On 12/10/24 17:20, Suren Baghdasaryan wrote:
>> > >> > On Tue, Dec 10, 2024 at 6:21 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>> > >> >>
>> > >> >> On 12/6/24 23:52, Suren Baghdasaryan wrote:
>> > >> >> > To enable SLAB_TYPESAFE_BY_RCU for vma cache we need to ensure that
>> > >> >> > object reuse before RCU grace period is over will be detected inside
>> > >> >> > lock_vma_under_rcu().
>> > >> >> > lock_vma_under_rcu() enters RCU read section, finds the vma at the
>> > >> >> > given address, locks the vma and checks if it got detached or remapped
>> > >> >> > to cover a different address range. These last checks are there
>> > >> >> > to ensure that the vma was not modified after we found it but before
>> > >> >> > locking it.
>> > >> >> > vma reuse introduces several new possibilities:
>> > >> >> > 1. vma can be reused after it was found but before it is locked;
>> > >> >> > 2. vma can be reused and reinitialized (including changing its vm_mm)
>> > >> >> > while being locked in vma_start_read();
>> > >> >> > 3. vma can be reused and reinitialized after it was found but before
>> > >> >> > it is locked, then attached at a new address or to a new mm while
>> > >> >> > read-locked;
>> > >> >> > For case #1 current checks will help detecting cases when:
>> > >> >> > - vma was reused but not yet added into the tree (detached check)
>> > >> >> > - vma was reused at a different address range (address check);
>> > >> >> > We are missing the check for vm_mm to ensure the reused vma was not
>> > >> >> > attached to a different mm. This patch adds the missing check.
>> > >> >> > For case #2, we pass mm to vma_start_read() to prevent access to
>> > >> >> > unstable vma->vm_mm. This might lead to vma_start_read() returning
>> > >> >> > a false locked result but that's not critical if it's rare because
>> > >> >> > it will only lead to a retry under mmap_lock.
>> > >> >> > For case #3, we ensure the order in which vma->detached flag and
>> > >> >> > vm_start/vm_end/vm_mm are set and checked. vma gets attached after
>> > >> >> > vm_start/vm_end/vm_mm were set and lock_vma_under_rcu() should check
>> > >> >> > vma->detached before checking vm_start/vm_end/vm_mm. This is required
>> > >> >> > because attaching vma happens without vma write-lock, as opposed to
>> > >> >> > vma detaching, which requires vma write-lock. This patch adds memory
>> > >> >> > barriers inside is_vma_detached() and vma_mark_attached() needed to
>> > >> >> > order reads and writes to vma->detached vs vm_start/vm_end/vm_mm.
>> > >> >> > After these provisions, SLAB_TYPESAFE_BY_RCU is added to vm_area_cachep.
>> > >> >> > This will facilitate vm_area_struct reuse and will minimize the number
>> > >> >> > of call_rcu() calls.
>> > >> >> >
>> > >> >> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
>> > >> >>
>> > >> >> I'm wondering about the vma freeing path. Consider vma_complete():
>> > >> >>
>> > >> >> vma_mark_detached(vp->remove);
>> > >> >>   vma->detached = true; - plain write
>> > >> >> vm_area_free(vp->remove);
>> > >> >>   vma->vm_lock_seq = UINT_MAX; - plain write
>> > >> >>   kmem_cache_free(vm_area_cachep)
>> > >> >> ...
>> > >> >> potential reallocation
>> > >> >>
>> > >> >> against:
>> > >> >>
>> > >> >> lock_vma_under_rcu()
>> > >> >> - mas_walk finds a stale vma due to race
>> > >> >> vma_start_read()
>> > >> >>   if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(mm->mm_lock_seq.sequence))
>> > >> >>   - can be false, the vma was not being locked on the freeing side?
>> > >> >>   down_read_trylock(&vma->vm_lock.lock) - suceeds, wasn't locked
>> > >> >>     this is acquire, but was there any release?
>> > >> >
>> > >> > Yes, there was a release. I think what you missed is that
>> > >> > vma_mark_detached() that is called from vma_complete() requires VMA to
>> > >> > be write-locked (see vma_assert_write_locked() in
>> > >> > vma_mark_detached()). The rule is that a VMA can be attached without
>> > >> > write-locking but only a write-locked VMA can be detached. So, after
>> > >>
>> > >> OK but write unlocking means the mm's seqcount is bumped and becomes
>> > >> non-equal with vma's vma->vm_lock_seq, right?
>> > >>
>> > >> Yet in the example above we happily set it to UINT_MAX and thus effectively
>> > >> false unlock it for vma_start_read()?
>> > >>
>> > >> And this is all done before the vma_complete() side would actually reach
>> > >> mmap_write_unlock(), AFAICS.
>> > >
>> > > Ah, you are right. With the possibility of reuse, even a freed VMA
>> > > should be kept write-locked until it is unlocked by
>> > > mmap_write_unlock(). I think the fix for this is simply to not reset
>> > > vma->vm_lock_seq inside vm_area_free(). I'll also need to add a
>> >
>> > But even if we don't reset vm_lock_seq to UINT_MAX, then whover reallocated
>> > it can proceed and end up doing a vma_start_write() and rewrite it there
>> > anyway, no?
>>
>> Actually, I think with a small change we can simplify these locking rules:
>>
>> static inline void vma_start_write(struct vm_area_struct *vma)
>> {
>>         int mm_lock_seq;
>>
>> -        if (__is_vma_write_locked(vma, &mm_lock_seq))
>> -                return;
>> +        mmap_assert_write_locked(vma->vm_mm);
>> +        mm_lock_seq = vma->vm_mm->mm_lock_seq;
>>
>>         down_write(&vma->vm_lock->lock);
>>         /*
>>         * We should use WRITE_ONCE() here because we can have concurrent reads
>>         * from the early lockless pessimistic check in vma_start_read().
>>         * We don't really care about the correctness of that early check, but
>>         * we should use WRITE_ONCE() for cleanliness and to keep KCSAN happy.
>>         */
>>         WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq);
>>         up_write(&vma->vm_lock->lock);
>> }
>>
>> This will force vma_start_write() to always write-lock vma->vm_lock
>> before changing vma->vm_lock_seq. Since vma->vm_lock survives reuse,
>> the other readers/writers will synchronize on it even if vma got
>> reused.
> 
> After thinking of all the alternatives, I think the cleanest way to
> handle vma detaching would be to follow the same pattern as for vma
> attaching. To attach a vma we do:
> 
> vma->vm_mm = xxx;
> ...
> vma_mark_attached()
>     smp_wmb();
>     WRITE_ONCE(vma->detached, false);
> 
> 
> lock_vma_under_rcu() ensures that a vma is attached and still
> unchanged like this:
> 
> lock_vma_under_rcu()
>     vma_start_read();
>     is_vma_detached()
>         detached = READ_ONCE(vma->detached);
>         smp_rmb();
>     if (vma->vm_mm != mm)
> 
> So, vm_area_free() can follow the same pattern to ensure vma reuse
> gets detected even if lock_vma_under_rcu() succeeds in locking the
> vma:
> 
> vm_area_free()
>     vma->vm_mm = NULL;
>     smp_wmb();
>     WRITE_ONCE(vma->detached, true);
> 
> Vlastimil, I think that should address the race you described. WDYT?

I'm not sure. AFAIU the barriers would ensure that if lock_vma_under_rcu()
sees detached, it also sees vm_mm is NULL. But as it doesn't ensure that it
will see it detached, so it also doesn't ensure we will see vm_mm as NULL.

I think the main problem is that we unlock the vma by writing to a mm, not
the vma, which makes it hard to apply the necessary SLAB_TYPESAFE_BY_RCU
validation patterns to it. I thought the direction you were discussing with
PeterZ in the other thread would solve this (in addition of getting rid of
the rwsem, which we were considering it anyway, but enabling
SLAB_TYPESAFE_BY_RCU by that would be a stronger argument).

Perhaps a solution to this that would work with the current rwsem would be
that setting detached and vm_mm to NULL would be set under the down_write()
of the rwsem. That would make sure that if lock_vma_under_rcu() succeeds the
down_read_trylock(), it would be guaranteed to see those assignments?

>>
>> >
>> > > comment for vm_lock_seq explaining these requirements.
>> > > Do you agree that such a change would resolve the issue?
>> > >
>> > >>
>> > >> > vma_mark_detached() and before down_read_trylock(&vma->vm_lock.lock)
>> > >> > in vma_start_read() the VMA write-lock should have been released by
>> > >> > mmap_write_unlock() and therefore vma->detached=false should be
>> > >> > visible to the reader when it executed lock_vma_under_rcu().
>> > >> >
>> > >> >>   is_vma_detached() - false negative as the write above didn't propagate
>> > >> >>     here yet; a read barrier but where is the write barrier?
>> > >> >>   checks for vma->vm_mm, vm_start, vm_end - nobody reset them yet so false
>> > >> >>     positive, or they got reset on reallocation but writes didn't propagate
>> > >> >>
>> > >> >> Am I missing something that would prevent lock_vma_under_rcu() falsely
>> > >> >> succeeding here?
>> > >> >>
>> > >>
>> >

Suren Baghdasaryan Dec. 11, 2024, 4:14 p.m. UTC | #15

On Wed, Dec 11, 2024 at 8:05 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 12/11/24 16:30, Suren Baghdasaryan wrote:
> > On Tue, Dec 10, 2024 at 3:01 PM Suren Baghdasaryan <surenb@google.com> wrote:
> >>
> >> On Tue, Dec 10, 2024 at 9:25 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> >> >
> >> > On 12/10/24 18:16, Suren Baghdasaryan wrote:
> >> > > On Tue, Dec 10, 2024 at 8:32 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> >> > >>
> >> > >> On 12/10/24 17:20, Suren Baghdasaryan wrote:
> >> > >> > On Tue, Dec 10, 2024 at 6:21 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> >> > >> >>
> >> > >> >> On 12/6/24 23:52, Suren Baghdasaryan wrote:
> >> > >> >> > To enable SLAB_TYPESAFE_BY_RCU for vma cache we need to ensure that
> >> > >> >> > object reuse before RCU grace period is over will be detected inside
> >> > >> >> > lock_vma_under_rcu().
> >> > >> >> > lock_vma_under_rcu() enters RCU read section, finds the vma at the
> >> > >> >> > given address, locks the vma and checks if it got detached or remapped
> >> > >> >> > to cover a different address range. These last checks are there
> >> > >> >> > to ensure that the vma was not modified after we found it but before
> >> > >> >> > locking it.
> >> > >> >> > vma reuse introduces several new possibilities:
> >> > >> >> > 1. vma can be reused after it was found but before it is locked;
> >> > >> >> > 2. vma can be reused and reinitialized (including changing its vm_mm)
> >> > >> >> > while being locked in vma_start_read();
> >> > >> >> > 3. vma can be reused and reinitialized after it was found but before
> >> > >> >> > it is locked, then attached at a new address or to a new mm while
> >> > >> >> > read-locked;
> >> > >> >> > For case #1 current checks will help detecting cases when:
> >> > >> >> > - vma was reused but not yet added into the tree (detached check)
> >> > >> >> > - vma was reused at a different address range (address check);
> >> > >> >> > We are missing the check for vm_mm to ensure the reused vma was not
> >> > >> >> > attached to a different mm. This patch adds the missing check.
> >> > >> >> > For case #2, we pass mm to vma_start_read() to prevent access to
> >> > >> >> > unstable vma->vm_mm. This might lead to vma_start_read() returning
> >> > >> >> > a false locked result but that's not critical if it's rare because
> >> > >> >> > it will only lead to a retry under mmap_lock.
> >> > >> >> > For case #3, we ensure the order in which vma->detached flag and
> >> > >> >> > vm_start/vm_end/vm_mm are set and checked. vma gets attached after
> >> > >> >> > vm_start/vm_end/vm_mm were set and lock_vma_under_rcu() should check
> >> > >> >> > vma->detached before checking vm_start/vm_end/vm_mm. This is required
> >> > >> >> > because attaching vma happens without vma write-lock, as opposed to
> >> > >> >> > vma detaching, which requires vma write-lock. This patch adds memory
> >> > >> >> > barriers inside is_vma_detached() and vma_mark_attached() needed to
> >> > >> >> > order reads and writes to vma->detached vs vm_start/vm_end/vm_mm.
> >> > >> >> > After these provisions, SLAB_TYPESAFE_BY_RCU is added to vm_area_cachep.
> >> > >> >> > This will facilitate vm_area_struct reuse and will minimize the number
> >> > >> >> > of call_rcu() calls.
> >> > >> >> >
> >> > >> >> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> >> > >> >>
> >> > >> >> I'm wondering about the vma freeing path. Consider vma_complete():
> >> > >> >>
> >> > >> >> vma_mark_detached(vp->remove);
> >> > >> >>   vma->detached = true; - plain write
> >> > >> >> vm_area_free(vp->remove);
> >> > >> >>   vma->vm_lock_seq = UINT_MAX; - plain write
> >> > >> >>   kmem_cache_free(vm_area_cachep)
> >> > >> >> ...
> >> > >> >> potential reallocation
> >> > >> >>
> >> > >> >> against:
> >> > >> >>
> >> > >> >> lock_vma_under_rcu()
> >> > >> >> - mas_walk finds a stale vma due to race
> >> > >> >> vma_start_read()
> >> > >> >>   if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(mm->mm_lock_seq.sequence))
> >> > >> >>   - can be false, the vma was not being locked on the freeing side?
> >> > >> >>   down_read_trylock(&vma->vm_lock.lock) - suceeds, wasn't locked
> >> > >> >>     this is acquire, but was there any release?
> >> > >> >
> >> > >> > Yes, there was a release. I think what you missed is that
> >> > >> > vma_mark_detached() that is called from vma_complete() requires VMA to
> >> > >> > be write-locked (see vma_assert_write_locked() in
> >> > >> > vma_mark_detached()). The rule is that a VMA can be attached without
> >> > >> > write-locking but only a write-locked VMA can be detached. So, after
> >> > >>
> >> > >> OK but write unlocking means the mm's seqcount is bumped and becomes
> >> > >> non-equal with vma's vma->vm_lock_seq, right?
> >> > >>
> >> > >> Yet in the example above we happily set it to UINT_MAX and thus effectively
> >> > >> false unlock it for vma_start_read()?
> >> > >>
> >> > >> And this is all done before the vma_complete() side would actually reach
> >> > >> mmap_write_unlock(), AFAICS.
> >> > >
> >> > > Ah, you are right. With the possibility of reuse, even a freed VMA
> >> > > should be kept write-locked until it is unlocked by
> >> > > mmap_write_unlock(). I think the fix for this is simply to not reset
> >> > > vma->vm_lock_seq inside vm_area_free(). I'll also need to add a
> >> >
> >> > But even if we don't reset vm_lock_seq to UINT_MAX, then whover reallocated
> >> > it can proceed and end up doing a vma_start_write() and rewrite it there
> >> > anyway, no?
> >>
> >> Actually, I think with a small change we can simplify these locking rules:
> >>
> >> static inline void vma_start_write(struct vm_area_struct *vma)
> >> {
> >>         int mm_lock_seq;
> >>
> >> -        if (__is_vma_write_locked(vma, &mm_lock_seq))
> >> -                return;
> >> +        mmap_assert_write_locked(vma->vm_mm);
> >> +        mm_lock_seq = vma->vm_mm->mm_lock_seq;
> >>
> >>         down_write(&vma->vm_lock->lock);
> >>         /*
> >>         * We should use WRITE_ONCE() here because we can have concurrent reads
> >>         * from the early lockless pessimistic check in vma_start_read().
> >>         * We don't really care about the correctness of that early check, but
> >>         * we should use WRITE_ONCE() for cleanliness and to keep KCSAN happy.
> >>         */
> >>         WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq);
> >>         up_write(&vma->vm_lock->lock);
> >> }
> >>
> >> This will force vma_start_write() to always write-lock vma->vm_lock
> >> before changing vma->vm_lock_seq. Since vma->vm_lock survives reuse,
> >> the other readers/writers will synchronize on it even if vma got
> >> reused.
> >
> > After thinking of all the alternatives, I think the cleanest way to
> > handle vma detaching would be to follow the same pattern as for vma
> > attaching. To attach a vma we do:
> >
> > vma->vm_mm = xxx;
> > ...
> > vma_mark_attached()
> >     smp_wmb();
> >     WRITE_ONCE(vma->detached, false);
> >
> >
> > lock_vma_under_rcu() ensures that a vma is attached and still
> > unchanged like this:
> >
> > lock_vma_under_rcu()
> >     vma_start_read();
> >     is_vma_detached()
> >         detached = READ_ONCE(vma->detached);
> >         smp_rmb();
> >     if (vma->vm_mm != mm)
> >
> > So, vm_area_free() can follow the same pattern to ensure vma reuse
> > gets detected even if lock_vma_under_rcu() succeeds in locking the
> > vma:
> >
> > vm_area_free()
> >     vma->vm_mm = NULL;
> >     smp_wmb();
> >     WRITE_ONCE(vma->detached, true);
> >
> > Vlastimil, I think that should address the race you described. WDYT?
>
> I'm not sure. AFAIU the barriers would ensure that if lock_vma_under_rcu()
> sees detached, it also sees vm_mm is NULL. But as it doesn't ensure that it
> will see it detached, so it also doesn't ensure we will see vm_mm as NULL.
>
> I think the main problem is that we unlock the vma by writing to a mm, not
> the vma, which makes it hard to apply the necessary SLAB_TYPESAFE_BY_RCU
> validation patterns to it. I thought the direction you were discussing with
> PeterZ in the other thread would solve this (in addition of getting rid of
> the rwsem, which we were considering it anyway, but enabling
> SLAB_TYPESAFE_BY_RCU by that would be a stronger argument).

I was hoping to implement SLAB_TYPESAFE_BY_RCU independently from
vm_lock change but you are probably right. Incorporating vma->detached
flag into the lock itself (which survives reuse) would make things way
easier. Let me pivot towards making that change first and see if
SLAB_TYPESAFE_BY_RCU becomes simpler.

>
> Perhaps a solution to this that would work with the current rwsem would be
> that setting detached and vm_mm to NULL would be set under the down_write()
> of the rwsem. That would make sure that if lock_vma_under_rcu() succeeds the
> down_read_trylock(), it would be guaranteed to see those assignments?

Yeah, that would definitely work. I was trying to avoid extra locking
but it looks like it's unavoidable.
Anyway, let me try replacing vm_lock first and will see where we end up.
Thanks for the input!

>
> >>
> >> >
> >> > > comment for vm_lock_seq explaining these requirements.
> >> > > Do you agree that such a change would resolve the issue?
> >> > >
> >> > >>
> >> > >> > vma_mark_detached() and before down_read_trylock(&vma->vm_lock.lock)
> >> > >> > in vma_start_read() the VMA write-lock should have been released by
> >> > >> > mmap_write_unlock() and therefore vma->detached=false should be
> >> > >> > visible to the reader when it executed lock_vma_under_rcu().
> >> > >> >
> >> > >> >>   is_vma_detached() - false negative as the write above didn't propagate
> >> > >> >>     here yet; a read barrier but where is the write barrier?
> >> > >> >>   checks for vma->vm_mm, vm_start, vm_end - nobody reset them yet so false
> >> > >> >>     positive, or they got reset on reallocation but writes didn't propagate
> >> > >> >>
> >> > >> >> Am I missing something that would prevent lock_vma_under_rcu() falsely
> >> > >> >> succeeding here?
> >> > >> >>
> >> > >>
> >> >
>

[v5,4/6] mm: make vma cache SLAB_TYPESAFE_BY_RCU

Commit Message

Comments

Patch