Message ID | 20241007-brauner-file-rcuref-v2-3-387e24dc9163@kernel.org (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | fs: introduce file_ref_t | expand |
On Mon, Oct 7, 2024 at 4:23 PM Christian Brauner <brauner@kernel.org> wrote: > Port files to rely on file_ref reference to improve scaling and gain > overflow protection. [...] > diff --git a/fs/file_table.c b/fs/file_table.c > index 4b23eb7b79dd9d4ec779f4c01ba2e902988895dc..3f5dc4176b21ff82cc9440ed92a0ad962fdb2046 100644 > --- a/fs/file_table.c > +++ b/fs/file_table.c > @@ -178,7 +178,7 @@ static int init_file(struct file *f, int flags, const struct cred *cred) > * fget-rcu pattern users need to be able to handle spurious > * refcount bumps we should reinitialize the reused file first. > */ > - atomic_long_set(&f->f_count, 1); > + file_ref_init(&f->f_ref, 1); It is good that you use file_ref_init() here to atomically initialize the file_ref; however, I think it is problematic that before this, alloc_empty_file() uses kmem_cache_zalloc(filp_cachep, GFP_KERNEL) to allocate the file, because that sets __GFP_ZERO, which means that slab_post_alloc_hook() will use memset() to zero the file object. That causes trouble in two different ways: 1. After the memset() has changed the file ref to zero, I think file_ref_get() can return true? Which means __get_file_rcu() could believe that it acquired a reference, and we could race like this: task A task B __get_file_rcu() rcu_dereference_raw() close() [frees file] alloc_empty_file() kmem_cache_zalloc() [reallocates same file] memset(..., 0, ...) file_ref_get() [increments 0->1, returns true] init_file() file_ref_init(..., 1) [sets to 0] rcu_dereference_raw() fput() file_ref_put() [decrements 0->FILE_REF_NOREF, frees file] [UAF] 2. AFAIK the memset() is not guaranteed to atomically update an "unsigned long", so you could see an entirely bogus torn counter value. The only reason this worked in the old code is that the refcount value stored in freed files is 0. So I think you need to stop using kmem_cache_zalloc() to allocate files, and instead use a constructor function that zeroes the refcount field, and manually memset() the rest of the "struct file" to 0 after calling kmem_cache_alloc(). > return 0; > } >
On Fri, Oct 25, 2024 at 10:45 PM Jann Horn <jannh@google.com> wrote: > On Mon, Oct 7, 2024 at 4:23 PM Christian Brauner <brauner@kernel.org> wrote: > > Port files to rely on file_ref reference to improve scaling and gain > > overflow protection. > [...] > > diff --git a/fs/file_table.c b/fs/file_table.c > > index 4b23eb7b79dd9d4ec779f4c01ba2e902988895dc..3f5dc4176b21ff82cc9440ed92a0ad962fdb2046 100644 > > --- a/fs/file_table.c > > +++ b/fs/file_table.c > > @@ -178,7 +178,7 @@ static int init_file(struct file *f, int flags, const struct cred *cred) > > * fget-rcu pattern users need to be able to handle spurious > > * refcount bumps we should reinitialize the reused file first. > > */ > > - atomic_long_set(&f->f_count, 1); > > + file_ref_init(&f->f_ref, 1); > > It is good that you use file_ref_init() here to atomically initialize > the file_ref; however, I think it is problematic that before this, > alloc_empty_file() uses kmem_cache_zalloc(filp_cachep, GFP_KERNEL) to > allocate the file, because that sets __GFP_ZERO, which means that > slab_post_alloc_hook() will use memset() to zero the file object. That > causes trouble in two different ways: > > > 1. After the memset() has changed the file ref to zero, I think > file_ref_get() can return true? Which means __get_file_rcu() could > believe that it acquired a reference, and we could race like this: > > task A task B > __get_file_rcu() > rcu_dereference_raw() > close() > [frees file] > alloc_empty_file() > kmem_cache_zalloc() > [reallocates same file] > memset(..., 0, ...) > file_ref_get() > [increments 0->1, returns true] > init_file() > file_ref_init(..., 1) > [sets to 0] > rcu_dereference_raw() > fput() > file_ref_put() > [decrements 0->FILE_REF_NOREF, frees file] > [UAF] > > > 2. AFAIK the memset() is not guaranteed to atomically update an > "unsigned long", so you could see an entirely bogus torn counter > value. > > The only reason this worked in the old code is that the refcount value > stored in freed files is 0. > > So I think you need to stop using kmem_cache_zalloc() to allocate > files, and instead use a constructor function that zeroes the refcount > field, and manually memset() the rest of the "struct file" to 0 after > calling kmem_cache_alloc(). Actually, thinking about this again, I guess there's actually no reason why you'd need a constructor function; if you just avoid memset()ing the refcount field on allocation, that'd be good enough.
> Actually, thinking about this again, I guess there's actually no > reason why you'd need a constructor function; if you just avoid > memset()ing the refcount field on allocation, that'd be good enough. Thanks for catching this. So what I did is: diff --git a/fs/file_table.c b/fs/file_table.c index 5b8b18259eca..c81a47aea47f 100644 --- a/fs/file_table.c +++ b/fs/file_table.c @@ -148,10 +148,14 @@ static int __init init_fs_stat_sysctls(void) fs_initcall(init_fs_stat_sysctls); #endif +static_assert(offsetof(struct file, f_ref) == 0); + static int init_file(struct file *f, int flags, const struct cred *cred) { int error; + memset(f + sizeof(file_ref_t), 0, sizeof(*f) - sizeof(file_ref_t)); + f->f_cred = get_cred(cred); error = security_file_alloc(f); if (unlikely(error)) { @@ -209,7 +213,7 @@ struct file *alloc_empty_file(int flags, const struct cred *cred) goto over; } - f = kmem_cache_zalloc(filp_cachep, GFP_KERNEL); + f = kmem_cache_alloc(filp_cachep, GFP_KERNEL); if (unlikely(!f)) return ERR_PTR(-ENOMEM); @@ -243,7 +247,7 @@ struct file *alloc_empty_file_noaccount(int flags, const struct cred *cred) struct file *f; int error; - f = kmem_cache_zalloc(filp_cachep, GFP_KERNEL); + f = kmem_cache_alloc(filp_cachep, GFP_KERNEL); if (unlikely(!f)) return ERR_PTR(-ENOMEM); @@ -270,7 +274,7 @@ struct file *alloc_empty_backing_file(int flags, const struct cred *cred) struct backing_file *ff; int error; - ff = kmem_cache_zalloc(bfilp_cachep, GFP_KERNEL); + ff = kmem_cache_alloc(bfilp_cachep, GFP_KERNEL); if (unlikely(!ff)) return ERR_PTR(-ENOMEM);
On Mon, 28 Oct 2024 at 01:17, Christian Brauner <brauner@kernel.org> wrote: > > Thanks for catching this. So what I did is: You had better remove the __randomize_layout from 'struct file' too, otherwise your patch is entirely broken. We should damn well remove it anyway, the whole struct randomization is just a bad joke. Nobody sane enables it, afaik. But for your patch in particular, it's now an active bug. Also, I wonder if we would be better off with f_count _away_ from the other fields we touch, because the file count ref always ends up making it cpu-local, so no shared caching behavior. We had that reported for the inode contents. So any threaded use of the same file will end up bouncing not just the refcount, but also won't be caching some of the useful info at the beginning of the file, that is basically read-only and could be shared across CPUs. Linus
On Mon, Oct 28, 2024 at 08:30:39AM -1000, Linus Torvalds wrote: > On Mon, 28 Oct 2024 at 01:17, Christian Brauner <brauner@kernel.org> wrote: > > > > Thanks for catching this. So what I did is: > > You had better remove the __randomize_layout from 'struct file' too, > otherwise your patch is entirely broken. Yeah, it has actively been baffling me why this ever made it to struct file in the first place. > > We should damn well remove it anyway, the whole struct randomization > is just a bad joke. Nobody sane enables it, afaik. But for your patch > in particular, it's now an active bug. I'm aware and so I did end up just initializing things manually instead of the proposed patch. We do always call init_file() and we do initialize a bunch of field in there already anyway. So might just as well do the rest of the fields. > > Also, I wonder if we would be better off with f_count _away_ from the > other fields we touch, because the file count ref always ends up > making it cpu-local, so no shared caching behavior. We had that > reported for the inode contents. > > So any threaded use of the same file will end up bouncing not just the > refcount, but also won't be caching some of the useful info at the > beginning of the file, that is basically read-only and could be shared > across CPUs. Yes, I'm aware of that and I commented on that in the commit message that we will likely end up reordering fields to prevent such false sharing.
On Mon, 28 Oct 2024 at 19:33, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Mon, 28 Oct 2024 at 01:17, Christian Brauner <brauner@kernel.org> wrote: > > > > Thanks for catching this. So what I did is: > > You had better remove the __randomize_layout from 'struct file' too, > otherwise your patch is entirely broken. > > We should damn well remove it anyway, the whole struct randomization > is just a bad joke. Nobody sane enables it, afaik. French Cybersecurity Agency (ANSSI) and the American NSA actively endorse it, and want to put that crap on the things of mandatory things for projects. Maybe you can find some curses for them, please? Linus-style and loud, please! Ced
diff --git a/drivers/gpu/drm/i915/gt/shmem_utils.c b/drivers/gpu/drm/i915/gt/shmem_utils.c index 1fb6ff77fd899111a0797dac0edd3f2cfa01f42d..bb696b29ee2c992c6b6d0ec5ae538f9ebbb9ed29 100644 --- a/drivers/gpu/drm/i915/gt/shmem_utils.c +++ b/drivers/gpu/drm/i915/gt/shmem_utils.c @@ -40,7 +40,7 @@ struct file *shmem_create_from_object(struct drm_i915_gem_object *obj) if (i915_gem_object_is_shmem(obj)) { file = obj->base.filp; - atomic_long_inc(&file->f_count); + get_file(file); return file; } diff --git a/drivers/gpu/drm/vmwgfx/ttm_object.c b/drivers/gpu/drm/vmwgfx/ttm_object.c index 3353e97687d1d5d0e05bdc8f26ae4b0aae53a997..a17e62867f3b33cd1aafade244d387b43bb66b51 100644 --- a/drivers/gpu/drm/vmwgfx/ttm_object.c +++ b/drivers/gpu/drm/vmwgfx/ttm_object.c @@ -471,7 +471,7 @@ void ttm_object_device_release(struct ttm_object_device **p_tdev) */ static bool __must_check get_dma_buf_unless_doomed(struct dma_buf *dmabuf) { - return atomic_long_inc_not_zero(&dmabuf->file->f_count) != 0L; + return file_ref_get(&dmabuf->file->f_ref); } /** diff --git a/fs/eventpoll.c b/fs/eventpoll.c index 1ae4542f0bd88b07e323d0dd75be6c0fe9fff54f..212383cefe6c9fe13a38061c2c81e5b6ff857925 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -1002,7 +1002,7 @@ static struct file *epi_fget(const struct epitem *epi) struct file *file; file = epi->ffd.file; - if (!atomic_long_inc_not_zero(&file->f_count)) + if (!file_ref_get(&file->f_ref)) file = NULL; return file; } diff --git a/fs/file.c b/fs/file.c index 1b5fc867d8ddff856501ba49d8c751f888810330..f4df09e92b6158ee40a794a4e0462f57acf595cf 100644 --- a/fs/file.c +++ b/fs/file.c @@ -972,7 +972,7 @@ static struct file *__get_file_rcu(struct file __rcu **f) if (!file) return NULL; - if (unlikely(!atomic_long_inc_not_zero(&file->f_count))) + if (unlikely(!file_ref_get(&file->f_ref))) return ERR_PTR(-EAGAIN); file_reloaded = rcu_dereference_raw(*f); @@ -986,8 +986,8 @@ static struct file *__get_file_rcu(struct file __rcu **f) OPTIMIZER_HIDE_VAR(file_reloaded_cmp); /* - * atomic_long_inc_not_zero() above provided a full memory - * barrier when we acquired a reference. + * file_ref_get() above provided a full memory barrier when we + * acquired a reference. * * This is paired with the write barrier from assigning to the * __rcu protected file pointer so that if that pointer still @@ -1085,11 +1085,11 @@ static inline struct file *__fget_files_rcu(struct files_struct *files, * We need to confirm it by incrementing the refcount * and then check the lookup again. * - * atomic_long_inc_not_zero() gives us a full memory - * barrier. We only really need an 'acquire' one to - * protect the loads below, but we don't have that. + * file_ref_get() gives us a full memory barrier. We + * only really need an 'acquire' one to protect the + * loads below, but we don't have that. */ - if (unlikely(!atomic_long_inc_not_zero(&file->f_count))) + if (unlikely(!file_ref_get(&file->f_ref))) continue; /* diff --git a/fs/file_table.c b/fs/file_table.c index 4b23eb7b79dd9d4ec779f4c01ba2e902988895dc..3f5dc4176b21ff82cc9440ed92a0ad962fdb2046 100644 --- a/fs/file_table.c +++ b/fs/file_table.c @@ -178,7 +178,7 @@ static int init_file(struct file *f, int flags, const struct cred *cred) * fget-rcu pattern users need to be able to handle spurious * refcount bumps we should reinitialize the reused file first. */ - atomic_long_set(&f->f_count, 1); + file_ref_init(&f->f_ref, 1); return 0; } @@ -483,7 +483,7 @@ static DECLARE_DELAYED_WORK(delayed_fput_work, delayed_fput); void fput(struct file *file) { - if (atomic_long_dec_and_test(&file->f_count)) { + if (file_ref_put(&file->f_ref)) { struct task_struct *task = current; if (unlikely(!(file->f_mode & (FMODE_BACKING | FMODE_OPENED)))) { @@ -516,7 +516,7 @@ void fput(struct file *file) */ void __fput_sync(struct file *file) { - if (atomic_long_dec_and_test(&file->f_count)) + if (file_ref_put(&file->f_ref)) __fput(file); } diff --git a/include/linux/fs.h b/include/linux/fs.h index e3c603d01337650d562405500013f5c4cfed8eb6..fece097d41a8fb47a1483e5418f1e7319405bba2 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -45,6 +45,7 @@ #include <linux/slab.h> #include <linux/maple_tree.h> #include <linux/rw_hint.h> +#include <linux/file_ref.h> #include <asm/byteorder.h> #include <uapi/linux/fs.h> @@ -1030,7 +1031,7 @@ static inline int ra_has_index(struct file_ra_state *ra, pgoff_t index) * @f_freeptr: Pointer used by SLAB_TYPESAFE_BY_RCU file cache (don't touch.) */ struct file { - atomic_long_t f_count; + file_ref_t f_ref; spinlock_t f_lock; fmode_t f_mode; const struct file_operations *f_op; @@ -1078,15 +1079,15 @@ struct file_handle { static inline struct file *get_file(struct file *f) { - long prior = atomic_long_fetch_inc_relaxed(&f->f_count); - WARN_ONCE(!prior, "struct file::f_count incremented from zero; use-after-free condition present!\n"); + WARN_ONCE(!file_ref_get(&f->f_ref), + "struct file::f_ref incremented from zero; use-after-free condition present!\n"); return f; } struct file *get_file_rcu(struct file __rcu **f); struct file *get_file_active(struct file **f); -#define file_count(x) atomic_long_read(&(x)->f_count) +#define file_count(f) file_ref_read(&(f)->f_ref) #define MAX_NON_LFS ((1UL<<31) - 1)
Port files to rely on file_ref reference to improve scaling and gain overflow protection. Signed-off-by: Christian Brauner <brauner@kernel.org> --- drivers/gpu/drm/i915/gt/shmem_utils.c | 2 +- drivers/gpu/drm/vmwgfx/ttm_object.c | 2 +- fs/eventpoll.c | 2 +- fs/file.c | 14 +++++++------- fs/file_table.c | 6 +++--- include/linux/fs.h | 9 +++++---- 6 files changed, 18 insertions(+), 17 deletions(-)