diff mbox series

[1/2] fs/kernel_read_file: add support for duplicate detection

Message ID 20230524213620.3509138-2-mcgrof@kernel.org (mailing list archive)
State New, archived
Headers show
Series module: avoid all memory pressure due to duplicates | expand

Commit Message

Luis Chamberlain May 24, 2023, 9:36 p.m. UTC
Add support for a new call which allows to detect duplicate requests
for each inode passed. This enables users to avoid having to incur
a whole vmalloc() for duplicates inodes with kernel_read_file().
Support to avoid duplicates is desirable in-kernel since changing
existing userspace or kernel users to account for these duplicates
would otherwise be difficult to implement, and the measured impact
of amount of wasted memory due to duplicates with vmalloc is observed
to be high.

This currently goes disabled because no user exists yet which wants
this enabled. Kernel code which needs this enabled should select the
new CONFIG_KREAD_UNIQ, otherwise the API falls back to the existing
kernel_read_file_from_fd().

If we later need to have some code enable CONFIG_KREAD_UNIQ but some
not we can have the feature be enabled per enum kernel_read_file_id id.
For now this should cover the main future use case with modules and
allow easily to disable / enable this feature with just one future
kconfig option.

Contrary to kernel_read_file_from_fd() users of thew new API will
use kread_uniq_fd(), keep track of the inode internally, and once
done use kread_uniq_fd_free() once the inode is no longer used.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 fs/Kconfig                       |   3 +
 fs/kernel_read_file.c            | 124 +++++++++++++++++++++++++++++++
 include/linux/kernel_read_file.h |  14 ++++
 3 files changed, 141 insertions(+)

Comments

Linus Torvalds May 24, 2023, 9:52 p.m. UTC | #1
On Wed, May 24, 2023 at 2:36 PM Luis Chamberlain <mcgrof@kernel.org> wrote:
>
> Add support for a new call which allows to detect duplicate requests
> for each inode passed.

No.

This is all disgusting.

Stop adding horrific code for some made-up load that isn't real.

           Linus
Linus Torvalds May 24, 2023, 9:56 p.m. UTC | #2
On Wed, May 24, 2023 at 2:52 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Stop adding horrific code for some made-up load that isn't real.

Even if you trigger some "worst case 3x memory use", that is
_temporary_, and will be free'd in the end.

The patches to "fix" this are worse than the disease.

                  Linus
Luis Chamberlain May 24, 2023, 10:07 p.m. UTC | #3
On Wed, May 24, 2023 at 02:56:47PM -0700, Linus Torvalds wrote:
> On Wed, May 24, 2023 at 2:52 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > Stop adding horrific code for some made-up load that isn't real.
> 
> Even if you trigger some "worst case 3x memory use", that is
> _temporary_, and will be free'd in the end.
> 
> The patches to "fix" this are worse than the disease.

Fine with me, we can punt back and wait and hope udev fixes this.
No one can tell me I didn't try. Now let's hope userspace will try
an alternative.

  Luis
Linus Torvalds May 25, 2023, 4 a.m. UTC | #4
On Wed, May 24, 2023 at 2:52 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> This is all disgusting.

Bringing back the original thread, because I just sent an alternate
patch to Luis to test.

That one is also disgusting, but for different reasons: it needs some
polish if it works. It's a very simple patch, in that it just extends
our existing i_writecount and ETXTBSY logic to also have a "exclusive"
mode, and says that we do the module file reading in that exclusive
mode (so if/when udev in its incompetence tries to load the same
module X number of times at the same time, only one will read at a
time).

The disgusting part is mainly the hacky test for "id ==
READING_MODULE", and it would probably be better with some kind of
"exclusive flag" field for general use, but right now READING_MODULE
is basically that one user.

Luis having explained _why_ we'd want this (and honestly, it took a
couple of tries), I can only say that udev is horribly broken, and
this most definitely should be fixed in user mode. udev randomly
loading the same module multiple times just because it gets confused
is not ok.

Any udev developer that goes "we can't fix it in user space" should be
ashamed of themselves. Really? Just randomly doing the same thing in
parallel and expecting the kernel to sort out your mess? What a crock.

But this *might* mitigate that udev horror. And not introduce any new
kernel-side horror, just a slight extension of our existing writer
exclusion logic to allow "full exclusive access".

(Note: it's not actually excluding other purely regular readers - but
it *is* excluding not just writers, but also other "special readers"
that want to exclude writers)

I'd like to point out that this patch really is completely untested.
It built for me, but that's all the testing it has gotten. It's
_small_. Tiny, even. But that "id == READING_MODULE" thing really is
pretty disgusting and I feel this needs more thought.

                         Linus
Christian Brauner May 25, 2023, 7:01 a.m. UTC | #5
On Wed, May 24, 2023 at 02:52:50PM -0700, Linus Torvalds wrote:
> On Wed, May 24, 2023 at 2:36 PM Luis Chamberlain <mcgrof@kernel.org> wrote:
> >
> > Add support for a new call which allows to detect duplicate requests
> > for each inode passed.
> 
> No.
> 
> This is all disgusting.
> 
> Stop adding horrific code for some made-up load that isn't real.

Also, this series adds non-trivial amount of code to fs/ without ever
having Cced fsdevel. As I told the bpf folks already if fs/ code is
touched fsdevel should absolutely be Cced, please.

It's literally also the first thing that get_maintainers.pl spews out.
Luis Chamberlain May 25, 2023, 6:08 p.m. UTC | #6
+ fsdevel please review,

On Wed, May 24, 2023 at 09:00:02PM -0700, Linus Torvalds wrote:
> On Wed, May 24, 2023 at 2:52 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > This is all disgusting.
> 
> Bringing back the original thread, because I just sent an alternate
> patch to Luis to test.
> 
> That one is also disgusting, but for different reasons: it needs some
> polish if it works. It's a very simple patch, in that it just extends
> our existing i_writecount and ETXTBSY logic to also have a "exclusive"
> mode, and says that we do the module file reading in that exclusive
> mode (so if/when udev in its incompetence tries to load the same
> module X number of times at the same time, only one will read at a
> time).

Indeed, this is the sort of gem I was hoping we could acomplish.

> The disgusting part is mainly the hacky test for "id ==
> READING_MODULE", and it would probably be better with some kind of
> "exclusive flag" field for general use, but right now READING_MODULE
> is basically that one user.
> 
> Luis having explained _why_ we'd want this (and honestly, it took a
> couple of tries), I can only say that udev is horribly broken, and
> this most definitely should be fixed in user mode. udev randomly
> loading the same module multiple times just because it gets confused
> is not ok.

At this point it would be good for for someone on the udev camp to at
least to *try*. If the problem is the fork on udev, maybe the shmem
could be used to share the module context to prevent duplicates.

> Any udev developer that goes "we can't fix it in user space" should be
> ashamed of themselves. Really? Just randomly doing the same thing in
> parallel and expecting the kernel to sort out your mess? What a crock.
> 
> But this *might* mitigate that udev horror. And not introduce any new
> kernel-side horror, just a slight extension of our existing writer
> exclusion logic to allow "full exclusive access".

Yes, that expresses what is needed well and is simple enough.

> (Note: it's not actually excluding other purely regular readers - but
> it *is* excluding not just writers, but also other "special readers"
> that want to exclude writers)
> 
> I'd like to point out that this patch really is completely untested.
> It built for me, but that's all the testing it has gotten. It's
> _small_. Tiny, even. But that "id == READING_MODULE" thing really is
> pretty disgusting and I feel this needs more thought.

>  fs/kernel_read_file.c | 6 +++++-
>  include/linux/fs.h    | 6 ++++++
>  2 files changed, 11 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/kernel_read_file.c b/fs/kernel_read_file.c
> index 5d826274570c..ff3e894f8cd4 100644
> --- a/fs/kernel_read_file.c
> +++ b/fs/kernel_read_file.c
> @@ -48,7 +48,11 @@ ssize_t kernel_read_file(struct file *file, loff_t offset, void **buf,
>  	if (!S_ISREG(file_inode(file)->i_mode))
>  		return -EINVAL;
>  
> -	ret = deny_write_access(file);
> +	/* Module reading wants *exclusive* access to the file */
> +	if (id == READING_MODULE)
> +		ret = exclusive_deny_write_access(file);
> +	else
> +		ret = deny_write_access(file);
>  	if (ret)
>  		return ret;
>  
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 21a981680856..722b42a77d51 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2566,6 +2566,12 @@ static inline int deny_write_access(struct file *file)
>  	struct inode *inode = file_inode(file);
>  	return atomic_dec_unless_positive(&inode->i_writecount) ? 0 : -ETXTBSY;
>  }
> +static inline int exclusive_deny_write_access(struct file *file)
> +{
> +	int old = 0;
> +	struct inode *inode = file_inode(file);
> +	return atomic_try_cmpxchg(&inode->i_writecount, &old, -1) ? 0 : -ETXTBSY;
> +}
>  static inline void put_write_access(struct inode * inode)
>  {
>  	atomic_dec(&inode->i_writecount);

Certainly on the track where I wish we could go. Now this goes tested.
On 255 cores:

Before:                                                                          
                                                                                 
vagrant@kmod ~ $ sudo systemd-analyze                                            
Startup finished in 41.653s (kernel) + 44.305s (userspace) = 1min 25.958s        
graphical.target reached after 44.178s in userspace.  

root@kmod ~ # grep "Virtual mem wasted bytes" /sys/kernel/debug/modules/stats    
 Virtual mem wasted bytes       1949006968 

                                                                                 
; 1949006968/1024/1024/1024                                                      
        ~1.81515418738126754761                                                  
                                                                                 
So ~1.8 GiB... of vmalloc space wasted during boot.

After:

systemd-analyze 
Startup finished in 24.438s (kernel) + 41.278s (userspace) = 1min 5.717s 
graphical.target reached after 41.154s in userspace.

root@kmod ~ # grep "Virtual mem wasted bytes" /sys/kernel/debug/modules/stats
 Virtual mem wasted bytes       354413398

So still 337.99 MiB of vmalloc space wasted during boot due to
duplicates. The reason is the exclusive_deny_write_access() must be
kept during the life of the module otherwise as soon as it is done
others can still race to load after and fail later after it sees the
module is already loaded. It sounds crazy to think that such races
exist because that means userspace didn't see the module loaded yet
and still tried finit_module() but the stats reveal that's the
case.

So with two other hunks added (2nd and 4th), this now matches parity with
my patch, not suggesting this is right, just demonstrating how this
could be resolved with this. We could also just have a helper which lets
the module code allow_write_access() at the end of its use of the fd
(failure to load or module is removed).

diff --git a/fs/kernel_read_file.c b/fs/kernel_read_file.c
index 5d826274570c..ff5b338a288b 100644
--- a/fs/kernel_read_file.c
+++ b/fs/kernel_read_file.c
@@ -48,7 +48,11 @@ ssize_t kernel_read_file(struct file *file, loff_t offset, void **buf,
 	if (!S_ISREG(file_inode(file)->i_mode))
 		return -EINVAL;
 
-	ret = deny_write_access(file);
+	/* Module reading wants *exclusive* access to the file */
+	if (id == READING_MODULE)
+		ret = exclusive_deny_write_access(file);
+	else
+		ret = deny_write_access(file);
 	if (ret)
 		return ret;
 
@@ -119,7 +123,8 @@ ssize_t kernel_read_file(struct file *file, loff_t offset, void **buf,
 	}
 
 out:
-	allow_write_access(file);
+	if (id != READING_MODULE)
+		allow_write_access(file);
 	return ret == 0 ? copied : ret;
 }
 EXPORT_SYMBOL_GPL(kernel_read_file);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 21a981680856..722b42a77d51 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2566,6 +2566,12 @@ static inline int deny_write_access(struct file *file)
 	struct inode *inode = file_inode(file);
 	return atomic_dec_unless_positive(&inode->i_writecount) ? 0 : -ETXTBSY;
 }
+static inline int exclusive_deny_write_access(struct file *file)
+{
+	int old = 0;
+	struct inode *inode = file_inode(file);
+	return atomic_try_cmpxchg(&inode->i_writecount, &old, -1) ? 0 : -ETXTBSY;
+}
 static inline void put_write_access(struct inode * inode)
 {
 	atomic_dec(&inode->i_writecount);
diff --git a/kernel/module/main.c b/kernel/module/main.c
index 044aa2c9e3cb..88aaada929b1 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -3078,8 +3079,10 @@ SYSCALL_DEFINE3(finit_module, int, fd, const char __user *, uargs, int, flags)
 	len = kernel_read_file_from_fd(fd, 0, &buf, INT_MAX, NULL,
 				       READING_MODULE);
 	if (len < 0) {
-		mod_stat_inc(&failed_kreads);
-		mod_stat_add_long(len, &invalid_kread_bytes);
+		if (len != -ETXTBSY) {
+			mod_stat_inc(&failed_kreads);
+			mod_stat_add_long(len, &invalid_kread_bytes);
+		}
 		return len;
 	}
Luis Chamberlain May 25, 2023, 6:35 p.m. UTC | #7
On Thu, May 25, 2023 at 11:08:13AM -0700, Luis Chamberlain wrote:
> + fsdevel please review,

> So with two other hunks added (2nd and 4th), this now matches parity with
> my patch, not suggesting this is right, just demonstrating how this
> could be resolved with this. We could also just have a helper which lets
> the module code allow_write_access() at the end of its use of the fd
> (failure to load or module is removed).

This even fixes the pathological case with stress-ng for finit_module:

./stress-ng --module 8192 --module-name xfs

(stress-ng assumes you have all dependencies already loaded and
 the module is not loaded, it uses finit_module() directly)

  Luis
Linus Torvalds May 25, 2023, 6:50 p.m. UTC | #8
On Thu, May 25, 2023 at 11:08 AM Luis Chamberlain <mcgrof@kernel.org> wrote:
>
> Certainly on the track where I wish we could go. Now this goes tested.
> On 255 cores:
>
> Before:
>
> vagrant@kmod ~ $ sudo systemd-analyze
> Startup finished in 41.653s (kernel) + 44.305s (userspace) = 1min 25.958s
> graphical.target reached after 44.178s in userspace.
>
> root@kmod ~ # grep "Virtual mem wasted bytes" /sys/kernel/debug/modules/stats
>  Virtual mem wasted bytes       1949006968
>
>
> ; 1949006968/1024/1024/1024
>         ~1.81515418738126754761
>
> So ~1.8 GiB... of vmalloc space wasted during boot.
>
> After:
>
> systemd-analyze
> Startup finished in 24.438s (kernel) + 41.278s (userspace) = 1min 5.717s
> graphical.target reached after 41.154s in userspace.
>
> root@kmod ~ # grep "Virtual mem wasted bytes" /sys/kernel/debug/modules/stats
>  Virtual mem wasted bytes       354413398
>
> So still 337.99 MiB of vmalloc space wasted during boot due to
> duplicates.

Ok. I think this will count as 'good enough for mitigation purposes'

> The reason is the exclusive_deny_write_access() must be
> kept during the life of the module otherwise as soon as it is done
> others can still race to load

Yes. The exclusion only applies while the file is actively being read.

> So with two other hunks added (2nd and 4th), this now matches parity with
> my patch, not suggesting this is right,

Yeah, we can't do that, because user space may quite validly want to
write the file afterwards.

Or, in fact, unload the module and re-load it.

So the "exclusion" really needs to be purely temporary.

That said, I considered moving the exclusion to module/main.c itself,
rather than the reading part. That wouild get rid of the hacky "id ==
READING_MODULE", and put the exclusion in the place that actually
wants it.

And that would allow us to at least extend that temporary exlusion a
bit - we could keep it until the module has actually been loaded and
inited.

So it would probably improve on those numbers a bit more, but you'd
still have the fundamental race where *serial* duplicates end up
always wasting CPU effort and temporary vmalloc space.

                   Linus
Luis Chamberlain May 25, 2023, 7:32 p.m. UTC | #9
On Thu, May 25, 2023 at 11:50 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> So it would probably improve on those numbers a bit more, but you'd
> still have the fundamental race where *serial* duplicates end up
> always wasting CPU effort and temporary vmalloc space.

The known failed boots are with KASAN with a large number of CPUs, so
the value in
the mitigation would be to help those boot until userspace fixes it
and we have enough
time for propagation. But since it is not a full proof solution, it
may seem like an odd thing
to have in place later and this being lost as odd tribal knowledge.
I'd be in favor of only
applying the mitigation if we really are chasing userspace to fix
this, and we'd be OK
in later removing it after userspace gets this fixed / propagated.

If we're going to have userspace fix this, who is volunteering?

  Luis
diff mbox series

Patch

diff --git a/fs/Kconfig b/fs/Kconfig
index cc07a0cd3172..0a78657b00d5 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -18,6 +18,9 @@  config VALIDATE_FS_PARSER
 config FS_IOMAP
 	bool
 
+config KREAD_UNIQ
+	bool
+
 # old blockdev_direct_IO implementation.  Use iomap for new code instead
 config LEGACY_DIRECT_IO
 	bool
diff --git a/fs/kernel_read_file.c b/fs/kernel_read_file.c
index 5d826274570c..a066e2f239e8 100644
--- a/fs/kernel_read_file.c
+++ b/fs/kernel_read_file.c
@@ -1,9 +1,12 @@ 
 // SPDX-License-Identifier: GPL-2.0-only
+#define pr_fmt(fmt) "kread: " fmt
+
 #include <linux/fs.h>
 #include <linux/fs_struct.h>
 #include <linux/kernel_read_file.h>
 #include <linux/security.h>
 #include <linux/vmalloc.h>
+#include <linux/fdtable.h>
 
 /**
  * kernel_read_file() - read file contents into a kernel buffer
@@ -187,3 +190,124 @@  ssize_t kernel_read_file_from_fd(int fd, loff_t offset, void **buf,
 	return ret;
 }
 EXPORT_SYMBOL_GPL(kernel_read_file_from_fd);
+
+#ifdef CONFIG_KREAD_UNIQ
+static DEFINE_MUTEX(kread_dup_mutex);
+static LIST_HEAD(kread_dup_reqs);
+
+struct kread_dup_req {
+	struct list_head list;
+	unsigned long i_ino;
+};
+
+static struct kread_dup_req *kread_dup_request_lookup(unsigned long i_ino)
+{
+	struct kread_dup_req *kread_req;
+
+	list_for_each_entry_rcu(kread_req, &kread_dup_reqs, list,
+				lockdep_is_held(&kread_dup_mutex)) {
+		if (kread_req->i_ino == i_ino)
+			return kread_req;
+        }
+
+	return NULL;
+}
+
+static struct kread_dup_req *kread_dup_request_new(char *name, unsigned long i_ino)
+{
+	struct kread_dup_req *kread_req, *new_kread_req;
+
+	/*
+	 * Pre-allocate the entry in case we have to use it later
+	 * to avoid contention with the mutex.
+	 */
+	new_kread_req = kzalloc(sizeof(*new_kread_req), GFP_KERNEL);
+	if (!new_kread_req)
+		return false;
+
+	new_kread_req->i_ino = i_ino;
+
+	kread_req = kread_dup_request_lookup(i_ino);
+	if (!kread_req) {
+		/*
+		 * There was no duplicate, just add the request so we can
+		 * keep tab on duplicates later.
+		 */
+		pr_debug("accepted request for i_ino %lu for: %s\n", i_ino, name);
+		return new_kread_req;
+	}
+
+	/* We are dealing with a duplicate request now */
+
+	kfree(new_kread_req);
+
+	pr_debug("duplicate request on i_ino %lu for: %s\n", i_ino, name);
+
+	return NULL;
+}
+
+ssize_t kread_uniq_fd(int fd, loff_t offset, void **buf, unsigned long *i_ino,
+		      size_t buf_size, size_t *file_size, enum kernel_read_file_id id)
+{
+	struct fd f = fdget(fd);
+	ssize_t ret = -EBADF;
+	char *name, *path;
+	struct kread_dup_req *req;
+
+	if (!f.file || !(f.file->f_mode & FMODE_READ))
+		goto out;
+
+	path = kzalloc(PATH_MAX, GFP_KERNEL);
+	if (!path)
+		return -ENOMEM;
+
+	name = file_path(f.file, path, PATH_MAX);
+	if (IS_ERR(name)) {
+		ret = PTR_ERR(name);
+		goto out_mem;
+	}
+
+	*i_ino = file_inode(f.file)->i_ino;
+
+	mutex_lock(&kread_dup_mutex);
+	req = kread_dup_request_new(name, *i_ino);
+	if (!req) {
+		mutex_unlock(&kread_dup_mutex);
+		ret = -EBUSY;
+		goto out_mem;
+	}
+
+	ret = kernel_read_file(f.file, offset, buf, buf_size, file_size, id);
+	if (ret < 0)
+		kfree(req);
+	else
+		list_add_rcu(&req->list, &kread_dup_reqs);
+	mutex_unlock(&kread_dup_mutex);
+out_mem:
+	kfree(path);
+out:
+	fdput(f);
+	return ret;
+}
+
+void kread_uniq_fd_free(unsigned long i_ino)
+{
+	struct kread_dup_req *kread_req;
+
+	if (!i_ino)
+		return;
+
+	mutex_lock(&kread_dup_mutex);
+
+	kread_req = kread_dup_request_lookup(i_ino);
+	if (!kread_req)
+		goto out;
+
+	list_del_rcu(&kread_req->list);
+	synchronize_rcu();
+
+out:
+	mutex_unlock(&kread_dup_mutex);
+	kfree(kread_req);
+}
+#endif /* CONFIG_KREAD_UNIQ */
diff --git a/include/linux/kernel_read_file.h b/include/linux/kernel_read_file.h
index 90451e2e12bd..884985b0dc88 100644
--- a/include/linux/kernel_read_file.h
+++ b/include/linux/kernel_read_file.h
@@ -51,5 +51,19 @@  ssize_t kernel_read_file_from_fd(int fd, loff_t offset,
 				 void **buf, size_t buf_size,
 				 size_t *file_size,
 				 enum kernel_read_file_id id);
+#ifdef CONFIG_KREAD_UNIQ
+ssize_t kread_uniq_fd(int fd, loff_t offset, void **buf, unsigned long *i_ino,
+		      size_t buf_size, size_t *file_size, enum kernel_read_file_id id);
+void kread_uniq_fd_free(unsigned long i_ino);
+#else
+static inline ssize_t kread_uniq_fd(int fd, loff_t offset, void **buf, unsigned long *i_ino,
+				    size_t buf_size, size_t *file_size, enum kernel_read_file_id id)
+{
+	return kernel_read_file_from_fd(fd, offset, buf, buf_size, file_size, id);
+}
+static inline void kread_uniq_fd_free(unsigned long i_ino)
+{
+}
+#endif /* CONFIG_KREAD_UNIQ */
 
 #endif /* _LINUX_KERNEL_READ_FILE_H */