From patchwork Fri Jul 21 22:51:07 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jann Horn X-Patchwork-Id: 13322640 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6B930EB64DD for ; Fri, 21 Jul 2023 22:51:20 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 8ABD88D0002; Fri, 21 Jul 2023 18:51:19 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 85BE68D0001; Fri, 21 Jul 2023 18:51:19 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 6FCF48D0002; Fri, 21 Jul 2023 18:51:19 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 59CA28D0001 for ; Fri, 21 Jul 2023 18:51:19 -0400 (EDT) Received: from smtpin22.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 21C14C043E for ; Fri, 21 Jul 2023 22:51:19 +0000 (UTC) X-FDA: 81037116678.22.669FE6C Received: from mail-wm1-f50.google.com (mail-wm1-f50.google.com [209.85.128.50]) by imf23.hostedemail.com (Postfix) with ESMTP id 45D16140006 for ; Fri, 21 Jul 2023 22:51:17 +0000 (UTC) Authentication-Results: imf23.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=p0LoYFvm; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf23.hostedemail.com: domain of jannh@google.com designates 209.85.128.50 as permitted sender) smtp.mailfrom=jannh@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1689979877; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:references:dkim-signature; bh=elohkkDV87wvNDTR2jsFWkT0oP/sefmnnrdNmfHP874=; b=V6TdzS9OAmF1SBFFPUNUaDKW547+qIAHjlZj8ltfUuZ2TyFWUb0NC3e6/4qRe2dPggIASu Uc5me/rkHY1cOk6Gj1L9WeApkA8uUthbGVYf3lHsL7+JyxR29Ly5kH0rNE1Cqd6p/JkDuM EsKqlyuWIdrBzv8AAqIUyUkTAnoNmrU= ARC-Authentication-Results: i=1; imf23.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=p0LoYFvm; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf23.hostedemail.com: domain of jannh@google.com designates 209.85.128.50 as permitted sender) smtp.mailfrom=jannh@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1689979877; a=rsa-sha256; cv=none; b=PZIHCoYhJPZoQBfhXuio4RM4pyFYICYg/+zymIkkG+WbhWeFpbqW4Bj41y1xXV7UygL20z +uIYCK+umNapQOvun32Ra66+JFDPRTNYLxU2G1TVsglQ6JoNjMsX0dj8dtQXOdrDue9zTV IzxMFj1LsX1gHES06VPg59ouRJ7vPfY= Received: by mail-wm1-f50.google.com with SMTP id 5b1f17b1804b1-3fbd33a1819so11505e9.1 for ; Fri, 21 Jul 2023 15:51:16 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1689979876; x=1690584676; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=elohkkDV87wvNDTR2jsFWkT0oP/sefmnnrdNmfHP874=; b=p0LoYFvmdEhS2v57N04R2NYWhY/gsDBii0fSZugz/0R0VEvyz2Ib5+5/Yg4jw7ll0U 1/Mg1pWgNCe3ikLnZI2hVjTqy3JM4qOVFuXbMbJKc45Y1b9PGSM6rCx2j65yzVD55lF6 uJXb62jzNqA1gdeeDxasalkWHvpnn/aZcd57LOw1ZFsxRLZZ1HRw6nc9uqnpR4bvRA25 YX8LP5VCFtDP6wX+w990POt7CaFcBQpz5GBD0y9ORO38oBuoy9PfPo+uGDFIUTotqwTN /1WGXZuAXjbrTcJcv3ogmYc7wNLHeGwi7kZgDoPsSCTGLGYJL39c1+2B5EwQf+mLGmmU A1SQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1689979876; x=1690584676; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=elohkkDV87wvNDTR2jsFWkT0oP/sefmnnrdNmfHP874=; b=A4vbzaspSOLxF3KSlltRrMNGF0xIIuxkXz+w26V0jt5k6RhodsSuzF3iuOCwWuVUij aHQm9YfcpqvQ5asyYdTmo9FyRoniH+EQBNjBkCxgA0TkuhkfFMeo+vWaTx6FuKSPHb8Q OkbkieZ1PEdAKfoDQNN4fOJscJWm4u6t3mtqVS4eO6CjDjtw8siqq/UV+toA/Igcp8kL MoRrSU9DZaUi0pRPW9x31cW1CeGTeQM1IeRq/jOoGRD5iBA4ih8Olp+8Ipl5LgHFJq6d MDC4SQpL/GK94qS64AXJyNam8DqJzKXkYt9XrcN7vwb7lGgOeaYSSA1vkdlpBIP0fArj xIJQ== X-Gm-Message-State: ABy/qLZCUT6POl0O32f8u/c9Gm+KkFQG9VmGgePfGy2CKHZfvt6C+Nsf JyJaDutwC/RHfwepWGqVIel64g== X-Google-Smtp-Source: APBJJlGhnxfItsRVlRyIJgcAPSHMwKvdMffVps2sPBaVtcO2K2UberiVtqIG1w63Hb3OWu0SAz+e4g== X-Received: by 2002:a05:600c:3b22:b0:3f7:e59f:2183 with SMTP id m34-20020a05600c3b2200b003f7e59f2183mr19372wms.5.1689979875582; Fri, 21 Jul 2023 15:51:15 -0700 (PDT) Received: from localhost ([2a00:79e0:9d:4:cce4:681f:e44b:4f75]) by smtp.gmail.com with ESMTPSA id q3-20020a05600c46c300b003fc3b03caa5sm7078473wmo.1.2023.07.21.15.51.14 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 21 Jul 2023 15:51:14 -0700 (PDT) From: Jann Horn To: Andrew Morton Cc: Suren Baghdasaryan , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Will Deacon , Peter Zijlstra Subject: [PATCH v2] mm: Fix memory ordering for mm_lock_seq and vm_lock_seq Date: Sat, 22 Jul 2023 00:51:07 +0200 Message-ID: <20230721225107.942336-1-jannh@google.com> X-Mailer: git-send-email 2.41.0.487.g6d72f3e995-goog MIME-Version: 1.0 X-Rspam-User: X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: 45D16140006 X-Stat-Signature: qezpp6i7jwynkir5x8he985dyfss99hw X-HE-Tag: 1689979876-714548 X-HE-Meta: U2FsdGVkX1+AveosEN5pG2SLSfxr+oLxVtJPqyuco84fEgjppqzvP2IO4CtWqMlrMUT09aa1ucWF4LZ10kqMe+rqSJaCNyBwgVUQc6Wbt+zUbPkhPbv2pPvLCP1+qRHA257XauMqNN8HHoKXtLrX6wFdZs/SSiS2+guPmsQJlVV81cq8QlfuRiP1Q8x12AnUMC+QLAhwJRBVvQM/659/Z4ef8pXbfUJr/xYhBkWLFD8kPlj47I/sFfvoWDw0Gh7xSg/xyRAYLI2W4QKaTBlmsr/DPeVONpARKjEDy+2t2VK5bsBLQcGF8EQLAretsz9wvU1iPzF0ViMkIgGE0jRzKrawaYOSvKE44nZAleoLHsJbolJp9rkx2iv5rEJSZM2BayhDIAQYTxJccFQPkE/98A8YKMFCHsXJZUdvkE34L86QZpdHf8DaCyf9mVYoOL5R7HGAIS15LpjjZPWpV8h68ML3VczISl8TTlbAqU2LFG24+oR/HI1ttyrB9cT2gGoCPQrLfMxNhQOvISuunYqv7QWA0YQS9ssdO7OYXz8dGsNbF3whiSKSXr+killzhq3vihvtVxvTN5ANvqkFhku0rNn5P4sRBhbjlNy7ctluva7xh+I4kJ0tMhlzMRmVqqfWx5t24CYv23wV4OKpq4sVwVoQdh+pu9dk1SAumSbDvzvnkBxa27Rn5k8pE7IFOGuJb8zko8/dGmTn66R3bfMFn3Lywf2S9hdI4lRl99j8Y7p/UCNVdV1O/n83lBlnFW4ACVOk5NCkUd+owp9UieX21vfFw852LwQASr5gMznvT4xqjhnrOQGpdfWKycCzSJnVQf/k690XxpvlCZayrf2GhsiYEL8fv5oAWozXrxYYckdIG/o99wtmFqQYQRKovVcTJzx5Yq8GpZtN9ngEBqpWMrw9Qfj+dPydF6OiKl4IzcY/fKyIvVdwoZ+APE0/+TLLULafsPhZFfbbSYwFs/z 9O7Y/yML UNXXIK74tru9BnjYojbY7T6bT8mMuOd1rSCu4Us9TilU8aC7CHJdL9oFF9vOXwTs7O+fgDjgR1qMffh2y8GQ99aBN6btBxp8bCkhPbvnwh38Wo2c6mQkkBfZLOORRc223//9x3/+nO7vDiyMMYvXqrUlML8eGhcKHIzjKs9Lz1lOxg8GcPZQEqLDPgORdzHw1F5NoXHHhkMzeU28aQEzxEDysyXlA3WWhliNBstJQMA6J9IWtVThjJe7/w4MjCmDXkMgKteMXryFGr6nozjkedONaD+rqXGMwacW2vAPV+uNzv7WcNJla2ngliVPGgFeNnL+yjPu/J5cW1LOUM8jPWkDGvQeASwvkOy5Jn2wWY3nWtalMQp+Ga6xhxu92KAgFxi9R1OW+Tqw2j/mRxJU9awFs8rWTIJr2qH194m47nMiOscJ2ngOFizfeVAtHR/+B5Pzus0wfQ+Ra0itkBglkjuG/4mpobkyr4uwhDzR32wZqz1g= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: mm->mm_lock_seq effectively functions as a read/write lock; therefore it must be used with acquire/release semantics. A specific example is the interaction between userfaultfd_register() and lock_vma_under_rcu(). userfaultfd_register() does the following from the point where it changes a VMA's flags to the point where concurrent readers are permitted again (in a simple scenario where only a single private VMA is accessed and no merging/splitting is involved): userfaultfd_register userfaultfd_set_vm_flags vm_flags_reset vma_start_write down_write(&vma->vm_lock->lock) vma->vm_lock_seq = mm_lock_seq [marks VMA as busy] up_write(&vma->vm_lock->lock) vm_flags_init [sets VM_UFFD_* in __vm_flags] vma->vm_userfaultfd_ctx.ctx = ctx mmap_write_unlock vma_end_write_all WRITE_ONCE(mm->mm_lock_seq, mm->mm_lock_seq + 1) [unlocks VMA] There are no memory barriers in between the __vm_flags update and the mm->mm_lock_seq update that unlocks the VMA, so the unlock can be reordered to above the `vm_flags_init()` call, which means from the perspective of a concurrent reader, a VMA can be marked as a userfaultfd VMA while it is not VMA-locked. That's bad, we definitely need a store-release for the unlock operation. The non-atomic write to vma->vm_lock_seq in vma_start_write() is mostly fine because all accesses to vma->vm_lock_seq that matter are always protected by the VMA lock. There is a racy read in vma_start_read() though that can tolerate false-positives, so we should be using WRITE_ONCE() to keep things tidy and data-race-free (including for KCSAN). On the other side, lock_vma_under_rcu() works as follows in the relevant region for locking and userfaultfd check: lock_vma_under_rcu vma_start_read vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq) [early bailout] down_read_trylock(&vma->vm_lock->lock) vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq) [main check] userfaultfd_armed checks vma->vm_flags & __VM_UFFD_FLAGS Here, the interesting aspect is how far down the mm->mm_lock_seq read can be reordered - if this read is reordered down below the vma->vm_flags access, this could cause lock_vma_under_rcu() to partly operate on information that was read while the VMA was supposed to be locked. To prevent this kind of downwards bleeding of the mm->mm_lock_seq read, we need to read it with a load-acquire. Some of the comment wording is based on suggestions by Suren. BACKPORT WARNING: One of the functions changed by this patch (which I've written against Linus' tree) is vma_try_start_write(), but this function no longer exists in mm/mm-everything. I don't know whether the merged version of this patch will be ordered before or after the patch that removes vma_try_start_write(). If you're backporting this patch to a tree with vma_try_start_write(), make sure this patch changes that function. Fixes: 5e31275cc997 ("mm: add per-VMA lock and helper functions to control it") Cc: stable@vger.kernel.org Cc: Suren Baghdasaryan Signed-off-by: Jann Horn Reviewed-by: Suren Baghdasaryan --- Notes: v2: made the comments much clearer based on off-list input from Suren include/linux/mm.h | 29 +++++++++++++++++++++++------ include/linux/mm_types.h | 28 ++++++++++++++++++++++++++++ include/linux/mmap_lock.h | 10 ++++++++-- 3 files changed, 59 insertions(+), 8 deletions(-) base-commit: d192f5382581d972c4ae1b4d72e0b59b34cadeb9 diff --git a/include/linux/mm.h b/include/linux/mm.h index 2dd73e4f3d8e..406ab9ea818f 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -641,8 +641,14 @@ static inline void vma_numab_state_free(struct vm_area_struct *vma) {} */ static inline bool vma_start_read(struct vm_area_struct *vma) { - /* Check before locking. A race might cause false locked result. */ - if (vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq)) + /* + * Check before locking. A race might cause false locked result. + * We can use READ_ONCE() for the mm_lock_seq here, and don't need + * ACQUIRE semantics, because this is just a lockless check whose result + * we don't rely on for anything - the mm_lock_seq read against which we + * need ordering is below. + */ + if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(vma->vm_mm->mm_lock_seq)) return false; if (unlikely(down_read_trylock(&vma->vm_lock->lock) == 0)) @@ -653,8 +659,13 @@ static inline bool vma_start_read(struct vm_area_struct *vma) * False unlocked result is impossible because we modify and check * vma->vm_lock_seq under vma->vm_lock protection and mm->mm_lock_seq * modification invalidates all existing locks. + * + * We must use ACQUIRE semantics for the mm_lock_seq so that if we are + * racing with vma_end_write_all(), we only start reading from the VMA + * after it has been unlocked. + * This pairs with RELEASE semantics in vma_end_write_all(). */ - if (unlikely(vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))) { + if (unlikely(vma->vm_lock_seq == smp_load_acquire(&vma->vm_mm->mm_lock_seq))) { up_read(&vma->vm_lock->lock); return false; } @@ -676,7 +687,7 @@ static bool __is_vma_write_locked(struct vm_area_struct *vma, int *mm_lock_seq) * current task is holding mmap_write_lock, both vma->vm_lock_seq and * mm->mm_lock_seq can't be concurrently modified. */ - *mm_lock_seq = READ_ONCE(vma->vm_mm->mm_lock_seq); + *mm_lock_seq = vma->vm_mm->mm_lock_seq; return (vma->vm_lock_seq == *mm_lock_seq); } @@ -688,7 +699,13 @@ static inline void vma_start_write(struct vm_area_struct *vma) return; down_write(&vma->vm_lock->lock); - vma->vm_lock_seq = mm_lock_seq; + /* + * We should use WRITE_ONCE() here because we can have concurrent reads + * from the early lockless pessimistic check in vma_start_read(). + * We don't really care about the correctness of that early check, but + * we should use WRITE_ONCE() for cleanliness and to keep KCSAN happy. + */ + WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq); up_write(&vma->vm_lock->lock); } @@ -702,7 +719,7 @@ static inline bool vma_try_start_write(struct vm_area_struct *vma) if (!down_write_trylock(&vma->vm_lock->lock)) return false; - vma->vm_lock_seq = mm_lock_seq; + WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq); up_write(&vma->vm_lock->lock); return true; } diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index de10fc797c8e..5e74ce4a28cd 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -514,6 +514,20 @@ struct vm_area_struct { }; #ifdef CONFIG_PER_VMA_LOCK + /* + * Can only be written (using WRITE_ONCE()) while holding both: + * - mmap_lock (in write mode) + * - vm_lock->lock (in write mode) + * Can be read reliably while holding one of: + * - mmap_lock (in read or write mode) + * - vm_lock->lock (in read or write mode) + * Can be read unreliably (using READ_ONCE()) for pessimistic bailout + * while holding nothing (except RCU to keep the VMA struct allocated). + * + * This sequence counter is explicitly allowed to overflow; sequence + * counter reuse can only lead to occasional unnecessary use of the + * slowpath. + */ int vm_lock_seq; struct vma_lock *vm_lock; @@ -679,6 +693,20 @@ struct mm_struct { * by mmlist_lock */ #ifdef CONFIG_PER_VMA_LOCK + /* + * This field has lock-like semantics, meaning it is sometimes + * accessed with ACQUIRE/RELEASE semantics. + * Roughly speaking, incrementing the sequence number is + * equivalent to releasing locks on VMAs; reading the sequence + * number can be part of taking a read lock on a VMA. + * + * Can be modified under write mmap_lock using RELEASE + * semantics. + * Can be read with no other protection when holding write + * mmap_lock. + * Can be read with ACQUIRE semantics if not holding write + * mmap_lock. + */ int mm_lock_seq; #endif diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h index aab8f1b28d26..e05e167dbd16 100644 --- a/include/linux/mmap_lock.h +++ b/include/linux/mmap_lock.h @@ -76,8 +76,14 @@ static inline void mmap_assert_write_locked(struct mm_struct *mm) static inline void vma_end_write_all(struct mm_struct *mm) { mmap_assert_write_locked(mm); - /* No races during update due to exclusive mmap_lock being held */ - WRITE_ONCE(mm->mm_lock_seq, mm->mm_lock_seq + 1); + /* + * Nobody can concurrently modify mm->mm_lock_seq due to exclusive + * mmap_lock being held. + * We need RELEASE semantics here to ensure that preceding stores into + * the VMA take effect before we unlock it with this store. + * Pairs with ACQUIRE semantics in vma_start_read(). + */ + smp_store_release(&mm->mm_lock_seq, mm->mm_lock_seq + 1); } #else static inline void vma_end_write_all(struct mm_struct *mm) {}