From patchwork Fri Jul 21 18:23:50 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Jann Horn <jannh@google.com>
X-Patchwork-Id: 13322426
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id B525DEB64DD
	for <linux-mm@archiver.kernel.org>; Fri, 21 Jul 2023 18:24:10 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 4EF208D0003; Fri, 21 Jul 2023 14:24:10 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 478E78D0001; Fri, 21 Jul 2023 14:24:10 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 2F39A8D0003; Fri, 21 Jul 2023 14:24:10 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com
 [216.40.44.14])
	by kanga.kvack.org (Postfix) with ESMTP id 199A18D0001
	for <linux-mm@kvack.org>; Fri, 21 Jul 2023 14:24:10 -0400 (EDT)
Received: from smtpin29.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay05.hostedemail.com (Postfix) with ESMTP id D8D79402D5
	for <linux-mm@kvack.org>; Fri, 21 Jul 2023 18:24:09 +0000 (UTC)
X-FDA: 81036443418.29.E1D040B
Received: from mail-wm1-f53.google.com (mail-wm1-f53.google.com
 [209.85.128.53])
	by imf13.hostedemail.com (Postfix) with ESMTP id E835B20016
	for <linux-mm@kvack.org>; Fri, 21 Jul 2023 18:24:05 +0000 (UTC)
Authentication-Results: imf13.hostedemail.com;
	dkim=pass header.d=google.com header.s=20221208 header.b=51bO35Lu;
	dmarc=pass (policy=reject) header.from=google.com;
	spf=pass (imf13.hostedemail.com: domain of jannh@google.com designates
 209.85.128.53 as permitted sender) smtp.mailfrom=jannh@google.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed;
 d=hostedemail.com;
	s=arc-20220608; t=1689963846;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:references:dkim-signature;
	bh=WlrH3UbLVyXd1x+EIYD43eIAPsfg/Yg3kHcghstly7o=;
	b=ebS7cgFN14x3YIyON6sMUNkYdcQtoq0wSkqXX77b5M09p8AmziEBMQDp3+Aj9B0GV8kNKR
	bKxPS1RsZD+rWLUZEee7X2LMmn5ODeU4xt1KRF+pBszxQTEC5U+RkfhM3gAZ100KapTIwI
	2xsRAiFrb4QyJaCRQ4si1wqIKgvFiM8=
ARC-Authentication-Results: i=1;
	imf13.hostedemail.com;
	dkim=pass header.d=google.com header.s=20221208 header.b=51bO35Lu;
	dmarc=pass (policy=reject) header.from=google.com;
	spf=pass (imf13.hostedemail.com: domain of jannh@google.com designates
 209.85.128.53 as permitted sender) smtp.mailfrom=jannh@google.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1689963846; a=rsa-sha256;
	cv=none;
	b=mehFV99g+qma5NcBaI95zef1OZlmV9GIlPbhbaQuqQ4ELS86ucSTrgeRNjLHMO3NnK1BXZ
	3Fep+5Jn3tRMPZai84J3f6rxEf1s0xzX9y0Qut/QFQ4TAJMCAkmmJPQE0P8euNzDyHTmPq
	zWUER/6FWv1xD6TFPP7GKIZQ5cUykbc=
Received: by mail-wm1-f53.google.com with SMTP id
 5b1f17b1804b1-3fd28ae8b90so9645e9.1
        for <linux-mm@kvack.org>; Fri, 21 Jul 2023 11:24:05 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20221208; t=1689963844; x=1690568644;
        h=content-transfer-encoding:mime-version:message-id:date:subject:cc
         :to:from:from:to:cc:subject:date:message-id:reply-to;
        bh=WlrH3UbLVyXd1x+EIYD43eIAPsfg/Yg3kHcghstly7o=;
        b=51bO35Lun89Fl+No/8WDRfFeYHiMbSfqJMBnZU5Spg6uZ/9LWTSL6RSGNi9PXnWaq4
         bMuy8lZ3FfxSQLjInBl09/zjssBUPVMl6a2+A4eXVFP0L8LM7xxeDP4tTuxQpqT+tgO+
         o5n//7hgqhidfmDAE+/iEULuuE/wkzIj3knnkwse3A7ummCAVQdqUu/BNc2lCrlbmcQj
         0NgIiBMJNepIzCGjPvyr9cP7aqi5T5P4Hyf3KPKvpiAwsu3yHfYQUgiNIKbySLK0Vmm/
         H3HZlUPFMbtl0XjXoMQNmaqoYUJQZxVpW9E1ik93N7iWGTaFK2TODUUK9ZnIyuxjCATp
         ZGUw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20221208; t=1689963844; x=1690568644;
        h=content-transfer-encoding:mime-version:message-id:date:subject:cc
         :to:from:x-gm-message-state:from:to:cc:subject:date:message-id
         :reply-to;
        bh=WlrH3UbLVyXd1x+EIYD43eIAPsfg/Yg3kHcghstly7o=;
        b=JVi2PuWD6ei2+4KGkzg1K58VOfJ3F0EDUq25OYqesdnRTzwDwPRPxw2Cg9ixQhxmho
         Kr6832nk79qrv0rzALDMKfNdAxYMjr27G7sZj25IytTo7BV2abCOWct2EaRoKNzhUi0U
         8+5rbu7FvmNG9t++P/+jpC8Z37taucjXWU1Co77kwxgGLBN0Un/3/GcBAtymZPzzmfWz
         IbKii4qkP+y66Jz81/zyr++uOyC+fDW50gxC8IcU+Lf47RUl8MR+WZxMiMy/I2ijx+xl
         5YhjNttQy3ujWrtN4ZxUUitnC/uU3dqsDruWaI7KfGsrS1q8DXVYo0I95xq0jWuKN6MS
         vjMw==
X-Gm-Message-State: ABy/qLa+T670d9B9x8LkXx9e/jNX3HYlxjuYL69IC1zpz3sp1mYuZiBy
	xVKZxLLhlIMn4VJ5fKm9WuM61g==
X-Google-Smtp-Source: 
 APBJJlGHqr3wEl+C+vcGlF2++gLXbdZdEyAVTMLKB9S0o0u+ZuOgfSRAuJjv/jpUmxLGR9H1FPM4yg==
X-Received: by 2002:a05:600c:3c93:b0:3fc:75d:8f85 with SMTP id
 bg19-20020a05600c3c9300b003fc075d8f85mr12939wmb.6.1689963844143;
        Fri, 21 Jul 2023 11:24:04 -0700 (PDT)
Received: from localhost ([2a00:79e0:9d:4:cce4:681f:e44b:4f75])
        by smtp.gmail.com with ESMTPSA id
 z20-20020a1c4c14000000b003fc6179e20asm6345319wmf.1.2023.07.21.11.24.03
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Fri, 21 Jul 2023 11:24:03 -0700 (PDT)
From: Jann Horn <jannh@google.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Suren Baghdasaryan <surenb@google.com>,
	linux-mm@kvack.org,
	linux-kernel@vger.kernel.org,
	Will Deacon <will@kernel.org>,
	Peter Zijlstra <peterz@infradead.org>
Subject: [PATCH] mm: Fix memory ordering for mm_lock_seq and vm_lock_seq
Date: Fri, 21 Jul 2023 20:23:50 +0200
Message-ID: <20230721182350.845615-1-jannh@google.com>
X-Mailer: git-send-email 2.41.0.487.g6d72f3e995-goog
MIME-Version: 1.0
X-Rspamd-Queue-Id: E835B20016
X-Rspam-User: 
X-Rspamd-Server: rspam02
X-Stat-Signature: gibg5ufpqr5i8uh5hsqxs3y1w57bo17k
X-HE-Tag: 1689963845-506099
X-HE-Meta: 
 U2FsdGVkX1+zFdFSvBjnFdwGdPpqAJcJOonftNNHDQf3TF5EsqlYqviS9+IcZw3N5AOh4ko3+oU9kntUAPJnWXP5PSTVjcd3WuyWx1oU3axnf9AafhAe6NDnx7IsMbeHTqz266jsKGLIDdzmV+VDCCVkgJzdc6rRIwXBwcvUKzTxIx0A1dB/MCZn+Ip0SQSXVPhEqY4CjHRpXYdI6Twavsu9OTrlMTZQJOfvzN14vjpbZ6/Nmnc/Yifykr2/KTm1upPWr69+RP8BpOYllmfqAOSWynZPyQ6uiCk54ELkvOpAnztStmpvW+trm6AZjtyE4cmyAGGjYFYnQGehH1nfeIwPqBGyybjk5TurNTux7XOyi0S0WvagSKcCWAUVcL52B5Z4gKjSEXa+cPICh3W3pZttDAaH54rJzr3ljGB09x5K0MFAhFxX7Vb4L7MWGAptcYqZgGFdGqHc8CDZHRf/CvXty3Q5oYBgT81SwAWXukmjDeaVsiahVfhOwoFmiAdi1RIMB5BG4y6Ra5GjcqDT2hD5Xw1EggPNFr34z7GCStKBIZMuLKwNBmjUp/sXoHHUzLSrvW/WeVT30gZ2SVnJdBNlB2HrP+cOpExOd4ZSfaYvYDZpjW4l3clbIZBWyydvkRwVHW1RkC/W3XsHprFKHUabWpb/2hjnRzNbi8Hsl12w3Lo5bsx8IYUh15NfiAHB15TW8+glyRT5EWXTIlXxN582InplUsr3G7ySVbEf0LdgaPyEPM2+ET9+aZMB+f+uySON7KUORw5MtShREKFAGX9QxNq18jmI8iN3TCj0D2AiKXWkM8rhhVAmNVBykKu1HzWlJYrynTmAjd/l6pgi1TVzC6ro4l6+4C3own4I/atfWFLV7lq/mVpg7wm8Vx3GJkdsFWyPPr/q88Ejm61TqjtjwvRgGb3iY2/POAy3qOPJPtgRAg1QYoGWnTF9a161IMhD0Amqg+YmUiCVWmx
 wp7XMqdi
 vhluD4c4XTxqpLDIoExAE3oSGO47SxqWOyA24ZJizKKfkErXyySQH81ML0lzgAhNx6YMXoUndmMGQyeywcuV6bW5Uh7blTBSuakg/g3++WStYNCwS6B3N6xaQ/45YMvJF08DNdHrMf6yxHGYtsIkjZ5W2Az7qCPBdeYmZveO2OJy/KUVIDBLJDj6nI9Sd0jeUwFBYrlSBlOU9mKCpnCg39m5DZFTL6TMl/xIolLCscIHepb7KSJg4MtpMr+3eBXN7wTAghmhtMqpvrYTX9gkAUiKhZiJhwKLBH0Mp9uFBhNFpauMTJvBL3jk7DJuhQMlTZLjhGsNqMhrCXhKKEm4jxLYtZgeOoCDB1ZRDqgZXb8009Mip1TyR/awi9Q/iQ35hIvep8utYEEhFFoif0uaEc54ADxxK2XVHav2XWAio7E0DBznAK2cywUKrqIheszoDl+w4D6YFBhlXVxGWo0F1jSiuWjiT1lwnO7TuC7pVAY1buLo=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

mm->mm_lock_seq effectively functions as a read/write lock; therefore it
must be used with acquire/release semantics.

A specific example is the interaction between userfaultfd_register() and
lock_vma_under_rcu().
userfaultfd_register() does the following from the point where it changes
a VMA's flags to the point where concurrent readers are permitted again
(in a simple scenario where only a single private VMA is accessed and no
merging/splitting is involved):

userfaultfd_register
  userfaultfd_set_vm_flags
    vm_flags_reset
      vma_start_write
        down_write(&vma->vm_lock->lock)
        vma->vm_lock_seq = mm_lock_seq [marks VMA as busy]
        up_write(&vma->vm_lock->lock)
      vm_flags_init
        [sets VM_UFFD_* in __vm_flags]
  vma->vm_userfaultfd_ctx.ctx = ctx
  mmap_write_unlock
    vma_end_write_all
      WRITE_ONCE(mm->mm_lock_seq, mm->mm_lock_seq + 1) [unlocks VMA]

There are no memory barriers in between the __vm_flags update and the
mm->mm_lock_seq update that unlocks the VMA, so the unlock can be reordered
to above the `vm_flags_init()` call, which means from the perspective of a
concurrent reader, a VMA can be marked as a userfaultfd VMA while it is not
VMA-locked. That's bad, we definitely need a store-release for the unlock
operation.

The non-atomic write to vma->vm_lock_seq in vma_start_write() is mostly
fine because all accesses to vma->vm_lock_seq that matter are always
protected by the VMA lock. There is a racy read in vma_start_read() though
that can tolerate false-positives, so we should be using WRITE_ONCE() to
keep things tidy and data-race-free (including for KCSAN).

On the other side, lock_vma_under_rcu() works as follows in the relevant
region for locking and userfaultfd check:

lock_vma_under_rcu
  vma_start_read
    vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq) [early bailout]
    down_read_trylock(&vma->vm_lock->lock)
    vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq) [main check]
  userfaultfd_armed
    checks vma->vm_flags & __VM_UFFD_FLAGS

Here, the interesting aspect is how far down the mm->mm_lock_seq read
can be reordered - if this read is reordered down below the vma->vm_flags
access, this could cause lock_vma_under_rcu() to partly operate on
information that was read while the VMA was supposed to be locked.
To prevent this kind of downwards bleeding of the mm->mm_lock_seq read, we
need to read it with a load-acquire.

BACKPORT WARNING: One of the functions changed by this patch (which I've
written against Linus' tree) is vma_try_start_write(), but this function
no longer exists in mm/mm-everything. I don't know whether the merged
version of this patch will be ordered before or after the patch that
removes vma_try_start_write(). If you're backporting this patch to a
tree with vma_try_start_write(), make sure this patch changes that
function.

Fixes: 5e31275cc997 ("mm: add per-VMA lock and helper functions to control it")
Cc: stable@vger.kernel.org
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Jann Horn <jannh@google.com>
---
 include/linux/mm.h        | 21 ++++++++++++++++-----
 include/linux/mm_types.h  | 27 +++++++++++++++++++++++++++
 include/linux/mmap_lock.h | 10 ++++++++--
 3 files changed, 51 insertions(+), 7 deletions(-)


base-commit: d192f5382581d972c4ae1b4d72e0b59b34cadeb9

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2dd73e4f3d8e..beba0bfd43da 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -642,7 +642,7 @@ static inline void vma_numab_state_free(struct vm_area_struct *vma) {}
 static inline bool vma_start_read(struct vm_area_struct *vma)
 {
 	/* Check before locking. A race might cause false locked result. */
-	if (vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
+	if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(vma->vm_mm->mm_lock_seq))
 		return false;
 
 	if (unlikely(down_read_trylock(&vma->vm_lock->lock) == 0))
@@ -653,8 +653,13 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
 	 * False unlocked result is impossible because we modify and check
 	 * vma->vm_lock_seq under vma->vm_lock protection and mm->mm_lock_seq
 	 * modification invalidates all existing locks.
+	 *
+	 * We must use ACQUIRE semantics for the mm_lock_seq so that if we are
+	 * racing with vma_end_write_all(), we only start reading from the VMA
+	 * after it has been unlocked.
+	 * This pairs with RELEASE semantics in vma_end_write_all().
 	 */
-	if (unlikely(vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))) {
+	if (unlikely(vma->vm_lock_seq == smp_load_acquire(&vma->vm_mm->mm_lock_seq))) {
 		up_read(&vma->vm_lock->lock);
 		return false;
 	}
@@ -676,7 +681,7 @@ static bool __is_vma_write_locked(struct vm_area_struct *vma, int *mm_lock_seq)
 	 * current task is holding mmap_write_lock, both vma->vm_lock_seq and
 	 * mm->mm_lock_seq can't be concurrently modified.
 	 */
-	*mm_lock_seq = READ_ONCE(vma->vm_mm->mm_lock_seq);
+	*mm_lock_seq = vma->vm_mm->mm_lock_seq;
 	return (vma->vm_lock_seq == *mm_lock_seq);
 }
 
@@ -688,7 +693,13 @@ static inline void vma_start_write(struct vm_area_struct *vma)
 		return;
 
 	down_write(&vma->vm_lock->lock);
-	vma->vm_lock_seq = mm_lock_seq;
+	/*
+	 * We should use WRITE_ONCE() here because we can have concurrent reads
+	 * from the early lockless pessimistic check in vma_start_read().
+	 * We don't really care about the correctness of that early check, but
+	 * we should use WRITE_ONCE() for cleanliness and to keep KCSAN happy.
+	 */
+	WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq);
 	up_write(&vma->vm_lock->lock);
 }
 
@@ -702,7 +713,7 @@ static inline bool vma_try_start_write(struct vm_area_struct *vma)
 	if (!down_write_trylock(&vma->vm_lock->lock))
 		return false;
 
-	vma->vm_lock_seq = mm_lock_seq;
+	WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq);
 	up_write(&vma->vm_lock->lock);
 	return true;
 }
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index de10fc797c8e..aa2444966f5f 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -514,6 +514,19 @@ struct vm_area_struct {
 	};
 
 #ifdef CONFIG_PER_VMA_LOCK
+	/*
+	 * Can only be written (using WRITE_ONCE()) while holding both:
+	 *  - mmap_lock (in write mode)
+	 *  - vm_lock->lock (in write mode)
+	 * Can be read reliably while holding:
+	 *  - vm_lock->lock (in read or write mode)
+	 * Can be read unreliably (for pessimistic bailout) while holding
+	 * nothing (except RCU to keep the VMA struct allocated).
+	 *
+	 * This sequence counter is explicitly allowed to overflow; sequence
+	 * counter reuse can only lead to occasional unnecessary use of the
+	 * slowpath.
+	 */
 	int vm_lock_seq;
 	struct vma_lock *vm_lock;
 
@@ -679,6 +692,20 @@ struct mm_struct {
 					  * by mmlist_lock
 					  */
 #ifdef CONFIG_PER_VMA_LOCK
+		/*
+		 * This field has lock-like semantics; see also
+		 * vma->vm_lock_seq.
+		 * Incrementing the sequence number is equivalent to releasing
+		 * locks on VMAs and requires RELEASE semantics; reading the
+		 * sequence number is part of taking a read lock on a VMA and
+		 * requires ACQUIRE semantics.
+		 *
+		 * Can be written (with RELEASE semantics) while holding
+		 * mmap_lock in write mode.
+		 * Can be read (with ACQUIRE semantics) without holding any
+		 * locks on the MM (but you need to have a VMA locked to be able
+		 * to do anything useful with the result).
+		 */
 		int mm_lock_seq;
 #endif
 
diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
index aab8f1b28d26..e05e167dbd16 100644
--- a/include/linux/mmap_lock.h
+++ b/include/linux/mmap_lock.h
@@ -76,8 +76,14 @@ static inline void mmap_assert_write_locked(struct mm_struct *mm)
 static inline void vma_end_write_all(struct mm_struct *mm)
 {
 	mmap_assert_write_locked(mm);
-	/* No races during update due to exclusive mmap_lock being held */
-	WRITE_ONCE(mm->mm_lock_seq, mm->mm_lock_seq + 1);
+	/*
+	 * Nobody can concurrently modify mm->mm_lock_seq due to exclusive
+	 * mmap_lock being held.
+	 * We need RELEASE semantics here to ensure that preceding stores into
+	 * the VMA take effect before we unlock it with this store.
+	 * Pairs with ACQUIRE semantics in vma_start_read().
+	 */
+	smp_store_release(&mm->mm_lock_seq, mm->mm_lock_seq + 1);
 }
 #else
 static inline void vma_end_write_all(struct mm_struct *mm) {}