From patchwork Sat Oct 5 19:16:44 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Christian Brauner X-Patchwork-Id: 13823413 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 756B1231C90 for ; Sat, 5 Oct 2024 19:17:26 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1728155846; cv=none; b=gdqfjQkUxwFXuGGmDv4USnnzXQrMGBSFeLEdf8F9gAiXgcuBoJPuezG2wBb8QeN8kUcdIerg4PAAnzaYFIsfeiKJKu6QJ+VZVHqkrHavHcBhVVG0QFn+sCBN8lQrsq3joMHzVRd7ggw5I6cL71uUyEZ/BD2SG2uEk77opy9K9RU= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1728155846; c=relaxed/simple; bh=KtXd08KaoieUizwd0zwbc1+3U6h+44ZsMRtxU+18vGM=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=cn7CSYC4yFixuS9SSJlqsj4dwzoHIvkTFa7bXa+tFksd5j51qFtn1LPAhShigtY4tsBDbUGenGSsrBwKZ3FwhWjgsmQupDrflty5LNJ8Z7AVu2QNxkr+CQwYyq+r0vgHMyTA1RBZHBJcoabO1FzjiFy71QhL7lUH1qG2XMpJan4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=dghZQ0HH; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="dghZQ0HH" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 0CF68C4CECF; Sat, 5 Oct 2024 19:17:24 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1728155846; bh=KtXd08KaoieUizwd0zwbc1+3U6h+44ZsMRtxU+18vGM=; h=From:Date:Subject:References:In-Reply-To:To:Cc:From; b=dghZQ0HHQIq+UKLQLw/JVrNmAM/hTUOmUZl+13syFkL2up3EZPdIbYYgSm118ixbx OVjsCNjllhjHZRARK1qe9YoHAgL78RBf2nYMJFQ9lKPx6MOGYcasOMHU3q4HmiqBUs PbIDZMDy19xOfP6HvicalCUjW0tlhKj0CEA2SwRtarpLpVGGnN4HQmaoykZxrSHw3y lA6ogGa+bfkIxEKhAnnc8M23TGNOPbIZG9U7fOp2IxLqRYKZMgMcNRKQCBaqzBzz2T Idufia6LN0opzK2cLabeIn9TYxbwPhz0b4pglFW2gUxjGzgG2m+21GLO/SYBeuCSIX hV/uRlAdhjiuA== From: Christian Brauner Date: Sat, 05 Oct 2024 21:16:44 +0200 Subject: [PATCH RFC 1/4] fs: protect backing files with rcu Precedence: bulk X-Mailing-List: linux-fsdevel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Message-Id: <20241005-brauner-file-rcuref-v1-1-725d5e713c86@kernel.org> References: <20241005-brauner-file-rcuref-v1-0-725d5e713c86@kernel.org> In-Reply-To: <20241005-brauner-file-rcuref-v1-0-725d5e713c86@kernel.org> To: Linus Torvalds Cc: linux-fsdevel@vger.kernel.org, Thomas Gleixner , Jann Horn , Christian Brauner X-Mailer: b4 0.15-dev-dedf8 X-Developer-Signature: v=1; a=openpgp-sha256; l=2178; i=brauner@kernel.org; h=from:subject:message-id; bh=KtXd08KaoieUizwd0zwbc1+3U6h+44ZsMRtxU+18vGM=; b=owGbwMvMwCU28Zj0gdSKO4sYT6slMaQzTjgkuMgtZaHZn+01l56s1llzNF7tVKTeIok3MXcm+ Ey25u/b1lHKwiDGxSArpsji0G4SLrecp2KzUaYGzBxWJpAhDFycAjCRpe8YGRaeM7k8R3CHzrfW 5K2vVLgs7j92fZfM8Vr699Fjm7+LZxszMqye8PZNjd22yM8GL5q2HTqikCRXdjI/e3q2N/vsNNm sThYA X-Developer-Key: i=brauner@kernel.org; a=openpgp; fpr=4880B8C9BD0E5106FC070F4F7B3C391EFEA93624 Currently backing files are not under any form of rcu protection. Switching to rcuref requires rcu protection and so does the speculative vma lookup. Switch them to the same rcu slab as regular files. There should be no additional magic required as the lifetime of backing files are always tied to a regular file. Signed-off-by: Christian Brauner --- fs/file_table.c | 11 ++++++++--- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/fs/file_table.c b/fs/file_table.c index eed5ffad9997c24e533f88285deb537ddf9429ed..9fc9048145ca023ef8af8769d5f1234a69f10df1 100644 --- a/fs/file_table.c +++ b/fs/file_table.c @@ -40,6 +40,7 @@ static struct files_stat_struct files_stat = { /* SLAB cache for file structures */ static struct kmem_cache *filp_cachep __ro_after_init; +static struct kmem_cache *bfilp_cachep __ro_after_init; static struct percpu_counter nr_files __cacheline_aligned_in_smp; @@ -68,7 +69,7 @@ static inline void file_free(struct file *f) put_cred(f->f_cred); if (unlikely(f->f_mode & FMODE_BACKING)) { path_put(backing_file_user_path(f)); - kfree(backing_file(f)); + kmem_cache_free(bfilp_cachep, backing_file(f)); } else { kmem_cache_free(filp_cachep, f); } @@ -267,13 +268,13 @@ struct file *alloc_empty_backing_file(int flags, const struct cred *cred) struct backing_file *ff; int error; - ff = kzalloc(sizeof(struct backing_file), GFP_KERNEL); + ff = kmem_cache_zalloc(bfilp_cachep, GFP_KERNEL); if (unlikely(!ff)) return ERR_PTR(-ENOMEM); error = init_file(&ff->file, flags, cred); if (unlikely(error)) { - kfree(ff); + kmem_cache_free(bfilp_cachep, ff); return ERR_PTR(error); } @@ -529,6 +530,10 @@ void __init files_init(void) filp_cachep = kmem_cache_create("filp", sizeof(struct file), &args, SLAB_HWCACHE_ALIGN | SLAB_PANIC | SLAB_ACCOUNT | SLAB_TYPESAFE_BY_RCU); + + bfilp_cachep = kmem_cache_create("bfilp", sizeof(struct backing_file), + NULL, SLAB_HWCACHE_ALIGN | SLAB_PANIC | + SLAB_ACCOUNT | SLAB_TYPESAFE_BY_RCU); percpu_counter_init(&nr_files, 0, GFP_KERNEL); } From patchwork Sat Oct 5 19:16:45 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Christian Brauner X-Patchwork-Id: 13823414 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 11805231C90 for ; Sat, 5 Oct 2024 19:17:27 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1728155848; cv=none; b=dV+GlJHdRQP91LVWh3Q1/LkjZzcmLnyvwfvfhJrKcwWDhJg3dkKFOnHGPl0WHmE4bHClCL9y4GucGJTo/JqCyhL8VPAEswMn0Uqs43xGkwo8zx0xI7WgqRFNFSAmMVyokD7+DYKtILg2by9bLyNuYnqJSaVO1qbCAIDzRm2hu5M= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1728155848; c=relaxed/simple; bh=hiIQ4xmmKL6fwBjndiI5i6u8tN17hB9vgokNY+JavC8=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=L8A+NBH3CjdS2kQ2NdYuTgUZOdEp1yqOYzjaTWTE/g9SF85+YeFWfiqcbSSUBM+MimBLYtYj1z7G44WLTu8ZFKV9ucoLbOLt73bl42XyDxbPdudxn3JyDWDqQwXx0ecGuB41oBVgAYsT1xap9vwni4KzVQY18NfZGjZph8eOzvg= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=kmeLagP5; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="kmeLagP5" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 93548C4CECC; Sat, 5 Oct 2024 19:17:26 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1728155847; bh=hiIQ4xmmKL6fwBjndiI5i6u8tN17hB9vgokNY+JavC8=; h=From:Date:Subject:References:In-Reply-To:To:Cc:From; b=kmeLagP56LFM6e9SYvwvu5G46zsmCyToG0Ef7c1zhwDo7/w1KXkLTBb1osdAPVo5n AG//YP6cZwPaCzBWeMAdsBeB1AgwSjVaaaQZj3+f6zm31MYURb06raE+gsgUDSGsWR I1Ar+aYcWak18TXlHEG5j4DW9KdB5RgklIOUNowFFDZWJIKA5pItBsE1J1x5OOYdTy W4h/hKq2VB0v3p71YEwKJh6iSGsp/Ov5hdxc8I7CR//hjpSrq+gZuGQJ+OIMWjXwnA 5wChrLNriV7k0GKQDNPct0D4tCPfWQRLI/C6HB71Amk8fMlOWrpVhapCuPJeYlpjDE /c7s7JmwqikWw== From: Christian Brauner Date: Sat, 05 Oct 2024 21:16:45 +0200 Subject: [PATCH RFC 2/4] types: add rcuref_long_t Precedence: bulk X-Mailing-List: linux-fsdevel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Message-Id: <20241005-brauner-file-rcuref-v1-2-725d5e713c86@kernel.org> References: <20241005-brauner-file-rcuref-v1-0-725d5e713c86@kernel.org> In-Reply-To: <20241005-brauner-file-rcuref-v1-0-725d5e713c86@kernel.org> To: Linus Torvalds Cc: linux-fsdevel@vger.kernel.org, Thomas Gleixner , Jann Horn , Christian Brauner X-Mailer: b4 0.15-dev-dedf8 X-Developer-Signature: v=1; a=openpgp-sha256; l=869; i=brauner@kernel.org; h=from:subject:message-id; bh=hiIQ4xmmKL6fwBjndiI5i6u8tN17hB9vgokNY+JavC8=; b=kA0DAAoWkcYbwGV43KIByyZiAGcBkMKjoN1zKoAJ776MromzcjRaLGrtbl8DhEvi6soHt46v7 4h1BAAWCgAdFiEEQIc0Vx6nDHizMmkokcYbwGV43KIFAmcBkMIACgkQkcYbwGV43KIwWwD/QJ4c uCE9FVTIqeFUsS1W7hS8CsPvrMghHwj8twO5rWQA/3jSyMdY+lmHvbqqfPQEQMCFiQ3oc2nHYWb PuwjZwwEC X-Developer-Key: i=brauner@kernel.org; a=openpgp; fpr=4880B8C9BD0E5106FC070F4F7B3C391EFEA93624 Add a variant of rcuref that operates on atomic_long_t instead of atomic_t so it can be used for data structures that require atomic_long_t. Signed-off-by: Christian Brauner --- include/linux/types.h | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/include/linux/types.h b/include/linux/types.h index 2bc8766ba20cab014a380f02e5644bd0d772ec67..b10bf351f3e4d1f1c1ca16248199470306de4aa0 100644 --- a/include/linux/types.h +++ b/include/linux/types.h @@ -190,6 +190,16 @@ typedef struct { #define RCUREF_INIT(i) { .refcnt = ATOMIC_INIT(i - 1) } +typedef struct { +#ifdef CONFIG_64BIT + atomic64_t refcnt; +#else + atomic_t refcnt; +#endif +} rcuref_long_t; + +#define RCUREF_LONG_INIT(i) { .refcnt = ATOMIC_LONG_INIT(i - 1) } + struct list_head { struct list_head *next, *prev; }; From patchwork Sat Oct 5 19:16:46 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Christian Brauner X-Patchwork-Id: 13823415 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A5AA3231C90 for ; Sat, 5 Oct 2024 19:17:29 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1728155849; cv=none; b=A5XlOu7pkOvKL0O+Q2Rx2FNkL8NnN71J/tBD0kiGYTgFS7bBpGUK549M8a/0jYFZacS6eqitIQ5sphWhU2A2KjjwuTytkvdn8myjRfPkD39PfLxSvgmOu9hM/pyJoM/HObvD0KXacr2gvU21Ky9JlThU56uMXFMWRInF6JxrzmY= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1728155849; c=relaxed/simple; bh=tqccy1IxHvI3nqrBN37wvGnVxn9kFVyR0JOL1B3s+UM=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=shZYYK6iD+VfTs+FtMEQgdLDSXk7vLgDOjdiOxQRsBzSZSu96sTQaST7fNlv5tyVtYWTUtOwB4iW3rt9kJheb1Qv3yJo2v/D6NnqDKmseNamOZDavwlkEo+3Sz9rmi18QXmadEHmOmzrVvcaN3U5Tn5QRybpp2yeV5Rjl4r8rnk= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=or0UVmXp; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="or0UVmXp" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 259B3C4CECE; Sat, 5 Oct 2024 19:17:27 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1728155849; bh=tqccy1IxHvI3nqrBN37wvGnVxn9kFVyR0JOL1B3s+UM=; h=From:Date:Subject:References:In-Reply-To:To:Cc:From; b=or0UVmXpMXyV13Wq1i0qQnajyf1Q3lE8rBdP8Q0X0vJcPjD7oIy+1pPCGSAPSYUz6 4qzj5hnk/Lxelea+E7gq4o2I99XPRA6M5hbHt48x6FU+ohZEUKEMcGo2WlwV8gh409 /unsvM9ouapg/TE66rE5EDretj2sAk5KST5+/ThoNwTnu/WB/XTO99Y/E41qRlwA0R 1VTKOlHbwcjdl6AR3Qz9mrj1olMvWYKNIk09OQ+ASUFr7dDJYDsUPvJqX4z0XS/lJG aq6XiGu/C0iZjxfa18N5+c8tUVQtPdCNBeMfB2d+tN5Kt5wMeDbLbwcjwZ9qYBJ8+D L5Vx+FzOo5o9Q== From: Christian Brauner Date: Sat, 05 Oct 2024 21:16:46 +0200 Subject: [PATCH RFC 3/4] rcuref: add rcuref_long_*() helpers Precedence: bulk X-Mailing-List: linux-fsdevel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Message-Id: <20241005-brauner-file-rcuref-v1-3-725d5e713c86@kernel.org> References: <20241005-brauner-file-rcuref-v1-0-725d5e713c86@kernel.org> In-Reply-To: <20241005-brauner-file-rcuref-v1-0-725d5e713c86@kernel.org> To: Linus Torvalds Cc: linux-fsdevel@vger.kernel.org, Thomas Gleixner , Jann Horn , Christian Brauner X-Mailer: b4 0.15-dev-dedf8 X-Developer-Signature: v=1; a=openpgp-sha256; l=10467; i=brauner@kernel.org; h=from:subject:message-id; bh=tqccy1IxHvI3nqrBN37wvGnVxn9kFVyR0JOL1B3s+UM=; b=owGbwMvMwCU28Zj0gdSKO4sYT6slMaQzTjhsuKX+TVfdp0cbV+n9y2//kH1xzWmrFUUBC7eFc vz/yx5b3VHKwiDGxSArpsji0G4SLrecp2KzUaYGzBxWJpAhDFycAjARn0OMDBNVtljxvNyrxMzf vIFPYYJLgKHCt4/79P7dnLPBcHrquiyGf8ZcbK53NjjbLeraIOuy+srrBQc6fq0Sfxotr6SWL64 UzQUA X-Developer-Key: i=brauner@kernel.org; a=openpgp; fpr=4880B8C9BD0E5106FC070F4F7B3C391EFEA93624 Add a variant of rcuref helpers that operate on atomic_long_t instead of atomic_t so rcuref can be used for data structures that require atomic_long_t. Signed-off-by: Christian Brauner --- include/linux/rcuref_long.h | 165 ++++++++++++++++++++++++++++++++++++++++++++ lib/rcuref.c | 104 ++++++++++++++++++++++++++++ 2 files changed, 269 insertions(+) diff --git a/include/linux/rcuref_long.h b/include/linux/rcuref_long.h new file mode 100644 index 0000000000000000000000000000000000000000..7cedc537e5268e114f1a4221a4f1b0cb8d0e1241 --- /dev/null +++ b/include/linux/rcuref_long.h @@ -0,0 +1,165 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ +#ifndef _LINUX_RCUREF_LONG_H +#define _LINUX_RCUREF_LONG_H + +#include +#include +#include +#include +#include +#include +#include + +#ifdef CONFIG_64BIT +#define RCUREF_LONG_ONEREF 0x0000000000000000U +#define RCUREF_LONG_MAXREF 0x7FFFFFFFFFFFFFFFU +#define RCUREF_LONG_SATURATED 0xA000000000000000U +#define RCUREF_LONG_RELEASED 0xC000000000000000U +#define RCUREF_LONG_DEAD 0xE000000000000000U +#define RCUREF_LONG_NOREF 0xFFFFFFFFFFFFFFFFU +#else +#define RCUREF_LONG_ONEREF RCUREF_ONEREF +#define RCUREF_LONG_MAXREF RCUREF_MAXREF +#define RCUREF_LONG_SATURATED RCUREF_SATURATED +#define RCUREF_LONG_RELEASED RCUREF_RELEASED +#define RCUREF_LONG_DEAD RCUREF_DEAD +#define RCUREF_LONG_NOREF RCUREF_NOREF +#endif + +/** + * rcuref_long_init - Initialize a rcuref reference count with the given reference count + * @ref: Pointer to the reference count + * @cnt: The initial reference count typically '1' + */ +static inline void rcuref_long_init(rcuref_long_t *ref, unsigned long cnt) +{ + atomic_long_set(&ref->refcnt, cnt - 1); +} + +/** + * rcuref_long_read - Read the number of held reference counts of a rcuref + * @ref: Pointer to the reference count + * + * Return: The number of held references (0 ... N) + */ +static inline unsigned long rcuref_long_read(rcuref_long_t *ref) +{ + unsigned long c = atomic_long_read(&ref->refcnt); + + /* Return 0 if within the DEAD zone. */ + return c >= RCUREF_LONG_RELEASED ? 0 : c + 1; +} + +__must_check bool rcuref_long_get_slowpath(rcuref_long_t *ref); + +/** + * rcuref_long_get - Acquire one reference on a rcuref reference count + * @ref: Pointer to the reference count + * + * Similar to atomic_long_inc_not_zero() but saturates at RCUREF_LONG_MAXREF. + * + * Provides no memory ordering, it is assumed the caller has guaranteed the + * object memory to be stable (RCU, etc.). It does provide a control dependency + * and thereby orders future stores. See documentation in lib/rcuref.c + * + * Return: + * False if the attempt to acquire a reference failed. This happens + * when the last reference has been put already + * + * True if a reference was successfully acquired + */ +static inline __must_check bool rcuref_long_get(rcuref_long_t *ref) +{ + /* + * Unconditionally increase the reference count. The saturation and + * dead zones provide enough tolerance for this. + */ + if (likely(!atomic_long_add_negative_relaxed(1, &ref->refcnt))) + return true; + + /* Handle the cases inside the saturation and dead zones */ + return rcuref_long_get_slowpath(ref); +} + +__must_check bool rcuref_long_put_slowpath(rcuref_long_t *ref); + +/* + * Internal helper. Do not invoke directly. + */ +static __always_inline __must_check bool __rcuref_long_put(rcuref_long_t *ref) +{ + RCU_LOCKDEP_WARN(!rcu_read_lock_held() && preemptible(), + "suspicious rcuref_put_rcusafe() usage"); + /* + * Unconditionally decrease the reference count. The saturation and + * dead zones provide enough tolerance for this. + */ + if (likely(!atomic_long_add_negative_release(-1, &ref->refcnt))) + return false; + + /* + * Handle the last reference drop and cases inside the saturation + * and dead zones. + */ + return rcuref_long_put_slowpath(ref); +} + +/** + * rcuref_long_put_rcusafe -- Release one reference for a rcuref reference count RCU safe + * @ref: Pointer to the reference count + * + * Provides release memory ordering, such that prior loads and stores are done + * before, and provides an acquire ordering on success such that free() + * must come after. + * + * Can be invoked from contexts, which guarantee that no grace period can + * happen which would free the object concurrently if the decrement drops + * the last reference and the slowpath races against a concurrent get() and + * put() pair. rcu_read_lock()'ed and atomic contexts qualify. + * + * Return: + * True if this was the last reference with no future references + * possible. This signals the caller that it can safely release the + * object which is protected by the reference counter. + * + * False if there are still active references or the put() raced + * with a concurrent get()/put() pair. Caller is not allowed to + * release the protected object. + */ +static inline __must_check bool rcuref_long_put_rcusafe(rcuref_long_t *ref) +{ + return __rcuref_long_put(ref); +} + +/** + * rcuref_long_put -- Release one reference for a rcuref reference count + * @ref: Pointer to the reference count + * + * Can be invoked from any context. + * + * Provides release memory ordering, such that prior loads and stores are done + * before, and provides an acquire ordering on success such that free() + * must come after. + * + * Return: + * + * True if this was the last reference with no future references + * possible. This signals the caller that it can safely schedule the + * object, which is protected by the reference counter, for + * deconstruction. + * + * False if there are still active references or the put() raced + * with a concurrent get()/put() pair. Caller is not allowed to + * deconstruct the protected object. + */ +static inline __must_check bool rcuref_long_put(rcuref_long_t *ref) +{ + bool released; + + preempt_disable(); + released = __rcuref_long_put(ref); + preempt_enable(); + return released; +} + +#endif diff --git a/lib/rcuref.c b/lib/rcuref.c index 97f300eca927ced7f36fe0c932d2a9d3759809b8..01a4c317c8bb7ff24632334ddb4520aa79aa46f3 100644 --- a/lib/rcuref.c +++ b/lib/rcuref.c @@ -176,6 +176,7 @@ #include #include +#include /** * rcuref_get_slowpath - Slowpath of rcuref_get() @@ -217,6 +218,46 @@ bool rcuref_get_slowpath(rcuref_t *ref) } EXPORT_SYMBOL_GPL(rcuref_get_slowpath); +/** + * rcuref_long_get_slowpath - Slowpath of rcuref_long_get() + * @ref: Pointer to the reference count + * + * Invoked when the reference count is outside of the valid zone. + * + * Return: + * False if the reference count was already marked dead + * + * True if the reference count is saturated, which prevents the + * object from being deconstructed ever. + */ +bool rcuref_long_get_slowpath(rcuref_long_t *ref) +{ + unsigned long cnt = atomic_long_read(&ref->refcnt); + + /* + * If the reference count was already marked dead, undo the + * increment so it stays in the middle of the dead zone and return + * fail. + */ + if (cnt >= RCUREF_LONG_RELEASED) { + atomic_long_set(&ref->refcnt, RCUREF_LONG_DEAD); + return false; + } + + /* + * If it was saturated, warn and mark it so. In case the increment + * was already on a saturated value restore the saturation + * marker. This keeps it in the middle of the saturation zone and + * prevents the reference count from overflowing. This leaks the + * object memory, but prevents the obvious reference count overflow + * damage. + */ + if (WARN_ONCE(cnt > RCUREF_LONG_MAXREF, "rcuref saturated - leaking memory")) + atomic_long_set(&ref->refcnt, RCUREF_LONG_SATURATED); + return true; +} +EXPORT_SYMBOL_GPL(rcuref_long_get_slowpath); + /** * rcuref_put_slowpath - Slowpath of __rcuref_put() * @ref: Pointer to the reference count @@ -279,3 +320,66 @@ bool rcuref_put_slowpath(rcuref_t *ref) return false; } EXPORT_SYMBOL_GPL(rcuref_put_slowpath); + +/** + * rcuref_long_put_slowpath - Slowpath of __rcuref_long_put() + * @ref: Pointer to the reference count + * + * Invoked when the reference count is outside of the valid zone. + * + * Return: + * True if this was the last reference with no future references + * possible. This signals the caller that it can safely schedule the + * object, which is protected by the reference counter, for + * deconstruction. + * + * False if there are still active references or the put() raced + * with a concurrent get()/put() pair. Caller is not allowed to + * deconstruct the protected object. + */ +bool rcuref_long_put_slowpath(rcuref_long_t *ref) +{ + unsigned long cnt = atomic_long_read(&ref->refcnt); + + /* Did this drop the last reference? */ + if (likely(cnt == RCUREF_LONG_NOREF)) { + /* + * Carefully try to set the reference count to RCUREF_LONG_DEAD. + * + * This can fail if a concurrent get() operation has + * elevated it again or the corresponding put() even marked + * it dead already. Both are valid situations and do not + * require a retry. If this fails the caller is not + * allowed to deconstruct the object. + */ + if (!atomic_long_try_cmpxchg_release(&ref->refcnt, &cnt, RCUREF_LONG_DEAD)) + return false; + + /* + * The caller can safely schedule the object for + * deconstruction. Provide acquire ordering. + */ + smp_acquire__after_ctrl_dep(); + return true; + } + + /* + * If the reference count was already in the dead zone, then this + * put() operation is imbalanced. Warn, put the reference count back to + * DEAD and tell the caller to not deconstruct the object. + */ + if (WARN_ONCE(cnt >= RCUREF_LONG_RELEASED, "rcuref - imbalanced put()")) { + atomic_long_set(&ref->refcnt, RCUREF_LONG_DEAD); + return false; + } + + /* + * This is a put() operation on a saturated refcount. Restore the + * mean saturation value and tell the caller to not deconstruct the + * object. + */ + if (cnt > RCUREF_LONG_MAXREF) + atomic_long_set(&ref->refcnt, RCUREF_LONG_SATURATED); + return false; +} +EXPORT_SYMBOL_GPL(rcuref_long_put_slowpath); From patchwork Sat Oct 5 19:16:47 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Christian Brauner X-Patchwork-Id: 13823416 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2F269231C90 for ; Sat, 5 Oct 2024 19:17:30 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1728155851; cv=none; b=qr2ePwKuZ6eAUHPdam3+hlcyi+8SNYUUVZBiEZtjbzvv/r0adFD8zw2/F2cg0ogyoeDbQ1pPYxN3o/fzimqP1F7hGXU5fKuXoCLSw2Tj48Td2rbzUqZaDG4s28+kp1p0ptGTQoflEPk2pSskfktzMHkLQL0olfHewP+dLhlf34A= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1728155851; c=relaxed/simple; bh=TjrlFnpPUkZwMkMMge5NVd4X+MrjZzl6ohROuGT0tOY=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=rtwj/cxyqfIODiuSbXl/SGk9gGGpwdovoD6cisnqoH1nLgc5AmKv1jzfxDCQCH1fOVEmNmUg+agyaWBIyFiRV6bSbnEpxLZ8JhHVGzkWC3ztZDPALJcEaza2KDiroGBW1D6m5FDOT1nhap9+NIgC6fTAQih6AJdegXJ6NYwrlCs= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=px89USbm; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="px89USbm" Received: by smtp.kernel.org (Postfix) with ESMTPSA id AC24FC4CECF; Sat, 5 Oct 2024 19:17:29 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1728155850; bh=TjrlFnpPUkZwMkMMge5NVd4X+MrjZzl6ohROuGT0tOY=; h=From:Date:Subject:References:In-Reply-To:To:Cc:From; b=px89USbmBMrwLu4cmBM7MKLFPSA+gOW1Wir3Hg5Uvc6z4ImffRTN+PYIP2qcSP/43 fpqwG1ZUs3r0N2RYzTN20u9EH3r1KLRlSXwWzQi+NA5EtVqp7wsIAI0oOaCYGR+SOf hAgBcIx0xDiyuwDf2WvI/ApxM2DEQc5dY3I4b0KbEP3+Ego39LNfDWIhx0YL69hzsu CykH3WymLYVrcGTe0WeByk+AE06ksWWyXFcvYwiYG4Fg0fsMrktZKEBCHGzkGnOcF+ EYYil/BhsPBiAVi6mOVrdUZQ1A+Y0TGyQNq0xAVmmXuqwGhabF/5UVdDqLCtDb8crB iW3T//aNTbulA== From: Christian Brauner Date: Sat, 05 Oct 2024 21:16:47 +0200 Subject: [PATCH RFC 4/4] fs: port files to rcuref_long_t Precedence: bulk X-Mailing-List: linux-fsdevel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Message-Id: <20241005-brauner-file-rcuref-v1-4-725d5e713c86@kernel.org> References: <20241005-brauner-file-rcuref-v1-0-725d5e713c86@kernel.org> In-Reply-To: <20241005-brauner-file-rcuref-v1-0-725d5e713c86@kernel.org> To: Linus Torvalds Cc: linux-fsdevel@vger.kernel.org, Thomas Gleixner , Jann Horn , Christian Brauner X-Mailer: b4 0.15-dev-dedf8 X-Developer-Signature: v=1; a=openpgp-sha256; l=13005; i=brauner@kernel.org; h=from:subject:message-id; bh=TjrlFnpPUkZwMkMMge5NVd4X+MrjZzl6ohROuGT0tOY=; b=owGbwMvMwCU28Zj0gdSKO4sYT6slMaQzTjhc9YPH+c3aHWZnj939df2L5tGcjdK7o4LCA+prv h48brR8VUcpC4MYF4OsmCKLQ7tJuNxynorNRpkaMHNYmUCGMHBxCsBE4qUYGe7PNXpY+dFVfZLf nsKTN1TZz9v+EHR5IWd2+vUa+VerV8YxMlydlHykNHVLglGQVEsu398JJZ8ttmsVcr24USTZdsB zNx8A X-Developer-Key: i=brauner@kernel.org; a=openpgp; fpr=4880B8C9BD0E5106FC070F4F7B3C391EFEA93624 As atomic_inc_not_zero() is implemented with a try_cmpxchg() loop it has O(N^2) behaviour under contention with N concurrent operations. The rcuref infrastructure uses atomic_add_negative_relaxed() for the fast path, which scales better under contention and we get overflow protection for free. I've been testing this with will-it-scale using fstat() on a machine that Jens gave me access (thank you very much!): processor : 511 vendor_id : AuthenticAMD cpu family : 25 model : 160 model name : AMD EPYC 9754 128-Core Processor and I consistently get a 3-5% improvement on 256+ threads. Files are SLAB_TYPESAFE_BY_RCU and thus don't have "regular" rcu protection. In short, freeing of files isn't delayed until a grace period has elapsed. Instead, they are freed immediately and thus can be reused (multiple times) within the same grace period. So when picking a file from the file descriptor table via its file descriptor number it is thus possible to see an elevated reference count on file->f_count even though the file has already been recycled possibly multiple times by another task. To guard against this the vfs will pick the file from the file descriptor via its file descriptor number twice. Once before the rcuref_long_get() and once after and compare the pointers (grossly simplified). If they match then the file is still valid. If not the caller needs to fput() it. The rcuref infrastructure requires explicit rcu protection to handle the following race: > Deconstruction race > =================== > > The release operation must be protected by prohibiting a grace period in > order to prevent a possible use after free: > > T1 T2 > put() get() > // ref->refcnt = ONEREF > if (!atomic_add_negative(-1, &ref->refcnt)) > return false; <- Not taken > > // ref->refcnt == NOREF > --> preemption > // Elevates ref->refcnt to ONEREF > if (!atomic_add_negative(1, &ref->refcnt)) > return true; <- taken > > if (put(&p->ref)) { <-- Succeeds > remove_pointer(p); > kfree_rcu(p, rcu); > } > > RCU grace period ends, object is freed > > atomic_cmpxchg(&ref->refcnt, NOREF, DEAD); <- UAF > > [...] it prevents the grace period which keeps the object alive until > all put() operations complete. Having files by SLAB_TYPESAFE_BY_RCU shouldn't cause any problems for the rcuref deconstruction race. Afaict, the only interesting case would be someone freeing the file and someone immediately recycling it within the same grace period and reinitializing file->f_count to ONEREF while a concurrent fput() is doing atomic_cmpxchg(&ref->refcnt, NOREF, DEAD) as in the race above. But this seems safe from SLAB_TYPESAFE_BY_RCU's perspective and it should be safe from rcuref's perspective. T1 T2 T3 fput() fget() // f_count->refcnt = ONEREF if (!atomic_add_negative(-1, &f_count->refcnt)) return false; <- Not taken // f_count->refcnt == NOREF --> preemption // Elevates f_count->refcnt to ONEREF if (!atomic_add_negative(1, &f_count->refcnt)) return true; <- taken if (put(&f_count)) { <-- Succeeds remove_pointer(p); /* * Cache is SLAB_TYPESAFE_BY_RCU * so this is freed without a grace period. */ kmem_cache_free(p); } kmem_cache_alloc() init_file() { // Sets f_count->refcnt to ONEREF rcuref_long_init(&f->f_count, 1); } Object has been reused within the same grace period via kmem_cache_alloc()'s SLAB_TYPESAFE_BY_RCU. /* * With SLAB_TYPESAFE_BY_RCU this would be a safe UAF access and * it would probably work correctly because the atomic_cmpxchg() * will fail because the refcount has been reset to ONEREF by T3. */ atomic_cmpxchg(&ref->refcnt, NOREF, DEAD); <- UAF Signed-off-by: Christian Brauner --- drivers/gpu/drm/i915/gt/shmem_utils.c | 2 +- drivers/gpu/drm/vmwgfx/ttm_object.c | 2 +- fs/eventpoll.c | 2 +- fs/file.c | 17 ++++++++--------- fs/file_table.c | 7 ++++--- include/linux/fs.h | 9 +++++---- include/linux/rcuref_long.h | 5 +++-- 7 files changed, 23 insertions(+), 21 deletions(-) diff --git a/drivers/gpu/drm/i915/gt/shmem_utils.c b/drivers/gpu/drm/i915/gt/shmem_utils.c index 1fb6ff77fd899111a0797dac0edd3f2cfa01f42d..bb696b29ee2c992c6b6d0ec5ae538f9ebbb9ed29 100644 --- a/drivers/gpu/drm/i915/gt/shmem_utils.c +++ b/drivers/gpu/drm/i915/gt/shmem_utils.c @@ -40,7 +40,7 @@ struct file *shmem_create_from_object(struct drm_i915_gem_object *obj) if (i915_gem_object_is_shmem(obj)) { file = obj->base.filp; - atomic_long_inc(&file->f_count); + get_file(file); return file; } diff --git a/drivers/gpu/drm/vmwgfx/ttm_object.c b/drivers/gpu/drm/vmwgfx/ttm_object.c index 3353e97687d1d5d0e05bdc8f26ae4b0aae53a997..539dfec0e623ec2be730924fe7b8e28a2ff1face 100644 --- a/drivers/gpu/drm/vmwgfx/ttm_object.c +++ b/drivers/gpu/drm/vmwgfx/ttm_object.c @@ -471,7 +471,7 @@ void ttm_object_device_release(struct ttm_object_device **p_tdev) */ static bool __must_check get_dma_buf_unless_doomed(struct dma_buf *dmabuf) { - return atomic_long_inc_not_zero(&dmabuf->file->f_count) != 0L; + return rcuref_long_get(&dmabuf->file->f_count); } /** diff --git a/fs/eventpoll.c b/fs/eventpoll.c index 1ae4542f0bd88b07e323d0dd75be6c0fe9fff54f..0a033950225af274c21e503a6ea4813e5bab5dc2 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -1002,7 +1002,7 @@ static struct file *epi_fget(const struct epitem *epi) struct file *file; file = epi->ffd.file; - if (!atomic_long_inc_not_zero(&file->f_count)) + if (!rcuref_long_get(&file->f_count)) file = NULL; return file; } diff --git a/fs/file.c b/fs/file.c index 5125607d040a2ff073e170d043124db5f444a90a..74e7a1cd709fc2147655d5e4b75cc2d8250bed88 100644 --- a/fs/file.c +++ b/fs/file.c @@ -866,10 +866,10 @@ static struct file *__get_file_rcu(struct file __rcu **f) if (!file) return NULL; - if (unlikely(!atomic_long_inc_not_zero(&file->f_count))) + if (unlikely(!rcuref_long_get(&file->f_count))) return ERR_PTR(-EAGAIN); - file_reloaded = rcu_dereference_raw(*f); + file_reloaded = smp_load_acquire(f); /* * Ensure that all accesses have a dependency on the load from @@ -880,8 +880,8 @@ static struct file *__get_file_rcu(struct file __rcu **f) OPTIMIZER_HIDE_VAR(file_reloaded_cmp); /* - * atomic_long_inc_not_zero() above provided a full memory - * barrier when we acquired a reference. + * smp_load_acquire() provided an acquire barrier when we loaded + * the file pointer. * * This is paired with the write barrier from assigning to the * __rcu protected file pointer so that if that pointer still @@ -979,11 +979,10 @@ static inline struct file *__fget_files_rcu(struct files_struct *files, * We need to confirm it by incrementing the refcount * and then check the lookup again. * - * atomic_long_inc_not_zero() gives us a full memory - * barrier. We only really need an 'acquire' one to - * protect the loads below, but we don't have that. + * rcuref_long_get() doesn't provide a memory barrier so + * we use smp_load_acquire() on the file pointer below. */ - if (unlikely(!atomic_long_inc_not_zero(&file->f_count))) + if (unlikely(!rcuref_long_get(&file->f_count))) continue; /* @@ -1000,7 +999,7 @@ static inline struct file *__fget_files_rcu(struct files_struct *files, * * If so, we need to put our ref and try again. */ - if (unlikely(file != rcu_dereference_raw(*fdentry)) || + if (unlikely(file != smp_load_acquire(fdentry)) || unlikely(rcu_dereference_raw(files->fdt) != fdt)) { fput(file); continue; diff --git a/fs/file_table.c b/fs/file_table.c index 9fc9048145ca023ef8af8769d5f1234a69f10df1..f4b96a9dade804a81347865625418a0fdc9a7c09 100644 --- a/fs/file_table.c +++ b/fs/file_table.c @@ -28,6 +28,7 @@ #include #include #include +#include #include @@ -175,7 +176,7 @@ static int init_file(struct file *f, int flags, const struct cred *cred) * fget-rcu pattern users need to be able to handle spurious * refcount bumps we should reinitialize the reused file first. */ - atomic_long_set(&f->f_count, 1); + rcuref_long_init(&f->f_count, 1); return 0; } @@ -480,7 +481,7 @@ static DECLARE_DELAYED_WORK(delayed_fput_work, delayed_fput); void fput(struct file *file) { - if (atomic_long_dec_and_test(&file->f_count)) { + if (rcuref_long_put_rcusafe(&file->f_count)) { struct task_struct *task = current; if (unlikely(!(file->f_mode & (FMODE_BACKING | FMODE_OPENED)))) { @@ -513,7 +514,7 @@ void fput(struct file *file) */ void __fput_sync(struct file *file) { - if (atomic_long_dec_and_test(&file->f_count)) + if (rcuref_long_put_rcusafe(&file->f_count)) __fput(file); } diff --git a/include/linux/fs.h b/include/linux/fs.h index e3c603d01337650d562405500013f5c4cfed8eb6..a7831eaf0edd13ebe9765e532602688b317da315 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -45,6 +45,7 @@ #include #include #include +#include #include #include @@ -1030,7 +1031,7 @@ static inline int ra_has_index(struct file_ra_state *ra, pgoff_t index) * @f_freeptr: Pointer used by SLAB_TYPESAFE_BY_RCU file cache (don't touch.) */ struct file { - atomic_long_t f_count; + rcuref_long_t f_count; spinlock_t f_lock; fmode_t f_mode; const struct file_operations *f_op; @@ -1078,15 +1079,15 @@ struct file_handle { static inline struct file *get_file(struct file *f) { - long prior = atomic_long_fetch_inc_relaxed(&f->f_count); - WARN_ONCE(!prior, "struct file::f_count incremented from zero; use-after-free condition present!\n"); + WARN_ONCE(!rcuref_long_get(&f->f_count), + "struct file::f_count incremented from zero; use-after-free condition present!\n"); return f; } struct file *get_file_rcu(struct file __rcu **f); struct file *get_file_active(struct file **f); -#define file_count(x) atomic_long_read(&(x)->f_count) +#define file_count(x) rcuref_long_read(&(x)->f_count) #define MAX_NON_LFS ((1UL<<31) - 1) diff --git a/include/linux/rcuref_long.h b/include/linux/rcuref_long.h index 7cedc537e5268e114f1a4221a4f1b0cb8d0e1241..10623119bb5038a1b171e31b8fd962a87e3670f5 100644 --- a/include/linux/rcuref_long.h +++ b/include/linux/rcuref_long.h @@ -85,11 +85,12 @@ __must_check bool rcuref_long_put_slowpath(rcuref_long_t *ref); /* * Internal helper. Do not invoke directly. + * + * Ideally we'd RCU_LOCKDEP_WARN() here but we can't since this api is + * used with SLAB_TYPSAFE_BY_RCU. */ static __always_inline __must_check bool __rcuref_long_put(rcuref_long_t *ref) { - RCU_LOCKDEP_WARN(!rcu_read_lock_held() && preemptible(), - "suspicious rcuref_put_rcusafe() usage"); /* * Unconditionally decrease the reference count. The saturation and * dead zones provide enough tolerance for this.