Message ID | 20230414052840.1994456-3-mcgrof@kernel.org (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | module: fix virtual memory wasted on finit_module() | expand |
On Thu, Apr 13, 2023 at 10:28:40PM -0700, Luis Chamberlain wrote: > With this we run into 0 wasted virtual memory bytes. This changelog does not make any sense at all, sorry. What are you doing here and why? > > Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> > --- > fs/kernel_read_file.c | 150 ++++++++++++++++++++++++++++++++++++++++++ > kernel/module/main.c | 6 +- > 2 files changed, 154 insertions(+), 2 deletions(-) > > diff --git a/fs/kernel_read_file.c b/fs/kernel_read_file.c > index 5d826274570c..209c56764442 100644 > --- a/fs/kernel_read_file.c > +++ b/fs/kernel_read_file.c > @@ -4,6 +4,7 @@ > #include <linux/kernel_read_file.h> > #include <linux/security.h> > #include <linux/vmalloc.h> > +#include <linux/fdtable.h> > > /** > * kernel_read_file() - read file contents into a kernel buffer > @@ -171,17 +172,166 @@ ssize_t kernel_read_file_from_path_initns(const char *path, loff_t offset, > } > EXPORT_SYMBOL_GPL(kernel_read_file_from_path_initns); > > +DEFINE_MUTEX(kread_dup_mutex); static? I stopped reading here :) thanks, greg k-h
On Fri, Apr 14, 2023 at 08:35:53AM +0200, Greg KH wrote: > On Thu, Apr 13, 2023 at 10:28:40PM -0700, Luis Chamberlain wrote: > > With this we run into 0 wasted virtual memory bytes. > > This changelog does not make any sense at all, sorry. What are you > doing here and why? It's an RFC and the cover letter described the motivation looking for ideas for an alternative, and it is the reason I was so terse on the commit log. Luis
On Thu, Apr 13, 2023 at 10:28:40PM -0700, Luis Chamberlain wrote:
> With this we run into 0 wasted virtual memory bytes.
Avoid what duplicates?
On Sat, Apr 15, 2023 at 11:04:12PM -0700, Christoph Hellwig wrote: > On Thu, Apr 13, 2023 at 10:28:40PM -0700, Luis Chamberlain wrote: > > With this we run into 0 wasted virtual memory bytes. > > Avoid what duplicates? David Hildenbrand had reported that with over 400 CPUs vmap space runs out and it seems it was related to module loading. I took a look and confirmed it. Module loading ends up requiring in the worst case 3 vmalloc allocations, so typically at least twice the size of the module size and in the worst case just add the decompressed module size: a) initial kernel_read*() call b) optional module decompression c) the actual module data copy we will keep Duplicate module requests that come from userspace end up being thrown in the trash bin, as only one module will be allocated. Although there are checks for a module prior to requesting a module udev still doesn't do the best of a job to avoid that and so we end up with tons of duplicate module requests. We're talking about gigabytes of vmalloc bytes just lost because of this for large systems and megabytes for average systems. So for example with just 255 CPUs we can loose about 13.58 GiB, and for 8 CPUs about 226.53 MiB. I have patches to curtail 1/2 of that space by doing a check in kernel before we do the allocation in c) if the module is already present. For a) it is harder because userspace just passes a file descriptor. But since we can get the file path without the vmalloc this RFC suggest maybe we can add a new kernel_read*() for module loading where it makes sense to have only one read happen at a time. Luis
On Sat, Apr 15, 2023 at 11:41:28PM -0700, Luis Chamberlain wrote: > On Sat, Apr 15, 2023 at 11:04:12PM -0700, Christoph Hellwig wrote: > > On Thu, Apr 13, 2023 at 10:28:40PM -0700, Luis Chamberlain wrote: > > > With this we run into 0 wasted virtual memory bytes. > > > > Avoid what duplicates? > > David Hildenbrand had reported that with over 400 CPUs vmap space > runs out and it seems it was related to module loading. I took a > look and confirmed it. Module loading ends up requiring in the > worst case 3 vmalloc allocations, so typically at least twice > the size of the module size and in the worst case just add > the decompressed module size: > > a) initial kernel_read*() call > b) optional module decompression > c) the actual module data copy we will keep > > Duplicate module requests that come from userspace end up being thrown > in the trash bin, as only one module will be allocated. Although there > are checks for a module prior to requesting a module udev still doesn't > do the best of a job to avoid that and so we end up with tons of > duplicate module requests. We're talking about gigabytes of vmalloc > bytes just lost because of this for large systems and megabytes for > average systems. So for example with just 255 CPUs we can loose about > 13.58 GiB, and for 8 CPUs about 226.53 MiB. How does the memory get "lost"? Shouldn't it be properly freed when the duplicate module load fails? thanks, greg k-h
On Sun, Apr 16, 2023 at 02:50:01PM +0200, Greg KH wrote: > On Sat, Apr 15, 2023 at 11:41:28PM -0700, Luis Chamberlain wrote: > > On Sat, Apr 15, 2023 at 11:04:12PM -0700, Christoph Hellwig wrote: > > > On Thu, Apr 13, 2023 at 10:28:40PM -0700, Luis Chamberlain wrote: > > > > With this we run into 0 wasted virtual memory bytes. > > > > > > Avoid what duplicates? > > > > David Hildenbrand had reported that with over 400 CPUs vmap space > > runs out and it seems it was related to module loading. I took a > > look and confirmed it. Module loading ends up requiring in the > > worst case 3 vmalloc allocations, so typically at least twice > > the size of the module size and in the worst case just add > > the decompressed module size: > > > > a) initial kernel_read*() call > > b) optional module decompression > > c) the actual module data copy we will keep > > > > Duplicate module requests that come from userspace end up being thrown > > in the trash bin, as only one module will be allocated. Although there > > are checks for a module prior to requesting a module udev still doesn't > > do the best of a job to avoid that and so we end up with tons of > > duplicate module requests. We're talking about gigabytes of vmalloc > > bytes just lost because of this for large systems and megabytes for > > average systems. So for example with just 255 CPUs we can loose about > > 13.58 GiB, and for 8 CPUs about 226.53 MiB. > > How does the memory get "lost"? Shouldn't it be properly freed when the > duplicate module load fails? Yes memory gets freed, but since virtual memory space can be limitted it also means you can end up eventually getting to the point -ENOMEMs will happen as you have more CPUS and you cannot use virtual memory for other things during kernel bootup and bootup fails. This is apparently exacerbated with KASAN enabled. Luis
On Sun, Apr 16, 2023 at 11:46:44AM -0700, Luis Chamberlain wrote: > On Sun, Apr 16, 2023 at 02:50:01PM +0200, Greg KH wrote: > > On Sat, Apr 15, 2023 at 11:41:28PM -0700, Luis Chamberlain wrote: > > > On Sat, Apr 15, 2023 at 11:04:12PM -0700, Christoph Hellwig wrote: > > > > On Thu, Apr 13, 2023 at 10:28:40PM -0700, Luis Chamberlain wrote: > > > > > With this we run into 0 wasted virtual memory bytes. > > > > > > > > Avoid what duplicates? > > > > > > David Hildenbrand had reported that with over 400 CPUs vmap space > > > runs out and it seems it was related to module loading. I took a > > > look and confirmed it. Module loading ends up requiring in the > > > worst case 3 vmalloc allocations, so typically at least twice > > > the size of the module size and in the worst case just add > > > the decompressed module size: > > > > > > a) initial kernel_read*() call > > > b) optional module decompression > > > c) the actual module data copy we will keep > > > > > > Duplicate module requests that come from userspace end up being thrown > > > in the trash bin, as only one module will be allocated. Although there > > > are checks for a module prior to requesting a module udev still doesn't > > > do the best of a job to avoid that and so we end up with tons of > > > duplicate module requests. We're talking about gigabytes of vmalloc > > > bytes just lost because of this for large systems and megabytes for > > > average systems. So for example with just 255 CPUs we can loose about > > > 13.58 GiB, and for 8 CPUs about 226.53 MiB. > > > > How does the memory get "lost"? Shouldn't it be properly freed when the > > duplicate module load fails? > > Yes memory gets freed, but since virtual memory space can be limitted it > also means you can end up eventually getting to the point -ENOMEMs will > happen as you have more CPUS and you cannot use virtual memory for other > things during kernel bootup and bootup fails. This is apparently > exacerbated with KASAN enabled. Then why not just rate-limit the module loader in userspace on such large systems if that's an issue? No kernel changes needed to do that. thanks, greg k-h
On Sat, 2023-04-15 at 23:41 -0700, Luis Chamberlain wrote: > On Sat, Apr 15, 2023 at 11:04:12PM -0700, Christoph Hellwig wrote: > > On Thu, Apr 13, 2023 at 10:28:40PM -0700, Luis Chamberlain wrote: > > > With this we run into 0 wasted virtual memory bytes. > > > > Avoid what duplicates? > > David Hildenbrand had reported that with over 400 CPUs vmap space > runs out and it seems it was related to module loading. I took a > look and confirmed it. Module loading ends up requiring in the > worst case 3 vmalloc allocations, so typically at least twice > the size of the module size and in the worst case just add > the decompressed module size: > > a) initial kernel_read*() call > b) optional module decompression > c) the actual module data copy we will keep > > Duplicate module requests that come from userspace end up being > thrown > in the trash bin, as only one module will be allocated. Although > there > are checks for a module prior to requesting a module udev still > doesn't > do the best of a job to avoid that and so we end up with tons of > duplicate module requests. We're talking about gigabytes of vmalloc > bytes just lost because of this for large systems and megabytes for > average systems. So for example with just 255 CPUs we can loose about > 13.58 GiB, and for 8 CPUs about 226.53 MiB. > > I have patches to curtail 1/2 of that space by doing a check in > kernel > before we do the allocation in c) if the module is already present. > For > a) it is harder because userspace just passes a file descriptor. But > since we can get the file path without the vmalloc this RFC suggest > maybe we can add a new kernel_read*() for module loading where it > makes > sense to have only one read happen at a time. I'm wondering how difficult it would be to just try to remove the vmallocs in (a) and (b) and operate on a list of pages. So the operations before module_patient_check_exists() are now: 1. decompressing (vmalloc) 2. signature check (vmalloc) 3. elf_validity_cache_copy() 4. early_mod_check() -> module_patient_check_exists() Right? Then after that a bunch of arch code and other code outside of modules operates on the vmalloc, so this other code would take a large amount of changes to switch to a list of pages. But did you consider teaching just 1-3 to operate on a list of pages? And then move module_patient_check_exists() a little up in the order? After module_patient_check_exists() you could vmap() the pages and hand it off to the existing code located all over the place. Then you can catch the duplicate requests before any vmalloc happens. It also takes (a) and (b) down to one vmalloc even in the normal case, as a side benefit. The changes to the signature check part might be tricky though. Sorry if this idea is off, I've got a little confused as this series split into all these offshoots series.
On Mon, Apr 17, 2023 at 08:05:31AM +0200, Greg KH wrote: > On Sun, Apr 16, 2023 at 11:46:44AM -0700, Luis Chamberlain wrote: > > On Sun, Apr 16, 2023 at 02:50:01PM +0200, Greg KH wrote: > > > On Sat, Apr 15, 2023 at 11:41:28PM -0700, Luis Chamberlain wrote: > > > > On Sat, Apr 15, 2023 at 11:04:12PM -0700, Christoph Hellwig wrote: > > > > > On Thu, Apr 13, 2023 at 10:28:40PM -0700, Luis Chamberlain wrote: > > > > > > With this we run into 0 wasted virtual memory bytes. > > > > > > > > > > Avoid what duplicates? > > > > > > > > David Hildenbrand had reported that with over 400 CPUs vmap space > > > > runs out and it seems it was related to module loading. I took a > > > > look and confirmed it. Module loading ends up requiring in the > > > > worst case 3 vmalloc allocations, so typically at least twice > > > > the size of the module size and in the worst case just add > > > > the decompressed module size: > > > > > > > > a) initial kernel_read*() call > > > > b) optional module decompression > > > > c) the actual module data copy we will keep > > > > > > > > Duplicate module requests that come from userspace end up being thrown > > > > in the trash bin, as only one module will be allocated. Although there > > > > are checks for a module prior to requesting a module udev still doesn't > > > > do the best of a job to avoid that and so we end up with tons of > > > > duplicate module requests. We're talking about gigabytes of vmalloc > > > > bytes just lost because of this for large systems and megabytes for > > > > average systems. So for example with just 255 CPUs we can loose about > > > > 13.58 GiB, and for 8 CPUs about 226.53 MiB. > > > > > > How does the memory get "lost"? Shouldn't it be properly freed when the > > > duplicate module load fails? > > > > Yes memory gets freed, but since virtual memory space can be limitted it > > also means you can end up eventually getting to the point -ENOMEMs will > > happen as you have more CPUS and you cannot use virtual memory for other > > things during kernel bootup and bootup fails. This is apparently > > exacerbated with KASAN enabled. > > Then why not just rate-limit the module loader in userspace on such > large systems if that's an issue? No kernel changes needed to do that. We can certainly just take a stance punt this as a userspace problem. I thought it would be good to see what a kernel style of workaround would look like for us to evluate. Luis
On Mon, Apr 17, 2023 at 05:33:49PM +0000, Edgecombe, Rick P wrote: > On Sat, 2023-04-15 at 23:41 -0700, Luis Chamberlain wrote: > > On Sat, Apr 15, 2023 at 11:04:12PM -0700, Christoph Hellwig wrote: > > > On Thu, Apr 13, 2023 at 10:28:40PM -0700, Luis Chamberlain wrote: > > > > With this we run into 0 wasted virtual memory bytes. > > > > > > Avoid what duplicates? > > > > David Hildenbrand had reported that with over 400 CPUs vmap space > > runs out and it seems it was related to module loading. I took a > > look and confirmed it. Module loading ends up requiring in the > > worst case 3 vmalloc allocations, so typically at least twice > > the size of the module size and in the worst case just add > > the decompressed module size: > > > > a) initial kernel_read*() call > > b) optional module decompression > > c) the actual module data copy we will keep > > > > Duplicate module requests that come from userspace end up being > > thrown > > in the trash bin, as only one module will be allocated. Although > > there > > are checks for a module prior to requesting a module udev still > > doesn't > > do the best of a job to avoid that and so we end up with tons of > > duplicate module requests. We're talking about gigabytes of vmalloc > > bytes just lost because of this for large systems and megabytes for > > average systems. So for example with just 255 CPUs we can loose about > > 13.58 GiB, and for 8 CPUs about 226.53 MiB. > > > > I have patches to curtail 1/2 of that space by doing a check in > > kernel > > before we do the allocation in c) if the module is already present. > > For > > a) it is harder because userspace just passes a file descriptor. But > > since we can get the file path without the vmalloc this RFC suggest > > maybe we can add a new kernel_read*() for module loading where it > > makes > > sense to have only one read happen at a time. > > I'm wondering how difficult it would be to just try to remove the > vmallocs in (a) and (b) and operate on a list of pages. Yes I think it's worth long term to do that, if possible with seq reads. Luis
On Mon, Apr 17, 2023 at 03:08:34PM -0700, Luis Chamberlain wrote: > On Mon, Apr 17, 2023 at 05:33:49PM +0000, Edgecombe, Rick P wrote: > > On Sat, 2023-04-15 at 23:41 -0700, Luis Chamberlain wrote: > > > On Sat, Apr 15, 2023 at 11:04:12PM -0700, Christoph Hellwig wrote: > > > > On Thu, Apr 13, 2023 at 10:28:40PM -0700, Luis Chamberlain wrote: > > > > > With this we run into 0 wasted virtual memory bytes. > > > > > > > > Avoid what duplicates? > > > > > > David Hildenbrand had reported that with over 400 CPUs vmap space > > > runs out and it seems it was related to module loading. I took a > > > look and confirmed it. Module loading ends up requiring in the > > > worst case 3 vmalloc allocations, so typically at least twice > > > the size of the module size and in the worst case just add > > > the decompressed module size: > > > > > > a) initial kernel_read*() call > > > b) optional module decompression > > > c) the actual module data copy we will keep > > > > > > Duplicate module requests that come from userspace end up being > > > thrown > > > in the trash bin, as only one module will be allocated. Although > > > there > > > are checks for a module prior to requesting a module udev still > > > doesn't > > > do the best of a job to avoid that and so we end up with tons of > > > duplicate module requests. We're talking about gigabytes of vmalloc > > > bytes just lost because of this for large systems and megabytes for > > > average systems. So for example with just 255 CPUs we can loose about > > > 13.58 GiB, and for 8 CPUs about 226.53 MiB. > > > > > > I have patches to curtail 1/2 of that space by doing a check in > > > kernel > > > before we do the allocation in c) if the module is already present. > > > For > > > a) it is harder because userspace just passes a file descriptor. But > > > since we can get the file path without the vmalloc this RFC suggest > > > maybe we can add a new kernel_read*() for module loading where it > > > makes > > > sense to have only one read happen at a time. > > > > I'm wondering how difficult it would be to just try to remove the > > vmallocs in (a) and (b) and operate on a list of pages. > > Yes I think it's worth long term to do that, if possible with seq reads. OK here's what I suggest we do then: I'll resubmit the first patch which allows us to prove / disprove if module-autoloading is the culprit. With that in place folks can debug their setup and verify how udev is to blame. I'll drop the second kernel_read*() patch / effort and punt this as a userspace problem as this is also not extremely pressing. Long term should evaluate how we can avoid vmalloc for the kread and module decompression. If this really becomes a pressing issue we can revisit if we want an in kernel solution, but at this point that likely would be systems with over 400-500 CPUs with KASAN enabled. Without KASAN the issue should eventually trigger if you're enablig modules but its hard to say at what point you'd hit this issue. Luis
diff --git a/fs/kernel_read_file.c b/fs/kernel_read_file.c index 5d826274570c..209c56764442 100644 --- a/fs/kernel_read_file.c +++ b/fs/kernel_read_file.c @@ -4,6 +4,7 @@ #include <linux/kernel_read_file.h> #include <linux/security.h> #include <linux/vmalloc.h> +#include <linux/fdtable.h> /** * kernel_read_file() - read file contents into a kernel buffer @@ -171,17 +172,166 @@ ssize_t kernel_read_file_from_path_initns(const char *path, loff_t offset, } EXPORT_SYMBOL_GPL(kernel_read_file_from_path_initns); +DEFINE_MUTEX(kread_dup_mutex); +static LIST_HEAD(kread_dup_reqs); + +struct kread_dup_req { + struct list_head list; + char name[PATH_MAX]; + struct completion first_req_done; + struct work_struct complete_work; + struct delayed_work delete_work; + int dup_ret; +}; + +static struct kread_dup_req *kread_dup_request_lookup(char *name) +{ + struct kread_dup_req *kread_req; + + list_for_each_entry_rcu(kread_req, &kread_dup_reqs, list, + lockdep_is_held(&kread_dup_mutex)) { + if (strlen(kread_req->name) == strlen(name) && + !memcmp(kread_req->name, name, strlen(name))) { + return kread_req; + } + } + + return NULL; +} + +static void kread_dup_request_delete(struct work_struct *work) +{ + struct kread_dup_req *kread_req; + kread_req = container_of(to_delayed_work(work), struct kread_dup_req, delete_work); + + mutex_lock(&kread_dup_mutex); + list_del_rcu(&kread_req->list); + synchronize_rcu(); + mutex_unlock(&kread_dup_mutex); + kfree(kread_req); +} + +static void kread_dup_request_complete(struct work_struct *work) +{ + struct kread_dup_req *kread_req; + + kread_req = container_of(work, struct kread_dup_req, complete_work); + + complete_all(&kread_req->first_req_done); + queue_delayed_work(system_wq, &kread_req->delete_work, 60 * HZ); +} + +static bool kread_dup_request_exists_wait(char *name, int *dup_ret) +{ + struct kread_dup_req *kread_req, *new_kread_req; + int ret; + + /* + * Pre-allocate the entry in case we have to use it later + * to avoid contention with the mutex. + */ + new_kread_req = kzalloc(sizeof(*new_kread_req), GFP_KERNEL); + if (!new_kread_req) + return false; + + memcpy(new_kread_req->name, name, strlen(name)); + INIT_WORK(&new_kread_req->complete_work, kread_dup_request_complete); + INIT_DELAYED_WORK(&new_kread_req->delete_work, kread_dup_request_delete); + init_completion(&new_kread_req->first_req_done); + + mutex_lock(&kread_dup_mutex); + + kread_req = kread_dup_request_lookup(name); + if (!kread_req) { + /* + * There was no duplicate, just add the request so we can + * keep tab on duplicates later. + */ + //pr_info("New kread request for %s\n", name); + list_add_rcu(&new_kread_req->list, &kread_dup_reqs); + mutex_unlock(&kread_dup_mutex); + return false; + } + mutex_unlock(&kread_dup_mutex); + + /* We are dealing with a duplicate request now */ + + kfree(new_kread_req); + + //pr_warn("kread: duplicate request for file %s\n", name); + + ret = wait_for_completion_state(&kread_req->first_req_done, + TASK_UNINTERRUPTIBLE | TASK_KILLABLE); + if (ret) { + *dup_ret = ret; + return true; + } + + /* breath */ + schedule_timeout(2*HZ); + + *dup_ret = kread_req->dup_ret; + + return true; +} + +void kread_dup_request_announce(char *name, int ret) +{ + struct kread_dup_req *kread_req; + + mutex_lock(&kread_dup_mutex); + + kread_req = kread_dup_request_lookup(name); + if (!kread_req) + goto out; + + kread_req->dup_ret = ret; + + /* + * If we complete() here we may allow duplicate threads + * to continue before the first one that submitted the + * request. We're in no rush but avoid boot delays caused + * by these threads waiting too long. + */ + queue_work(system_wq, &kread_req->complete_work); + +out: + mutex_unlock(&kread_dup_mutex); +} + ssize_t kernel_read_file_from_fd(int fd, loff_t offset, void **buf, size_t buf_size, size_t *file_size, enum kernel_read_file_id id) { struct fd f = fdget(fd); ssize_t ret = -EBADF; + char *name, *path; + int dup_ret; if (!f.file || !(f.file->f_mode & FMODE_READ)) goto out; + path = kzalloc(PATH_MAX, GFP_KERNEL); + if (!path) + return -ENOMEM; + + name = file_path(f.file, path, PATH_MAX); + if (IS_ERR(name)) { + ret = PTR_ERR(name); + goto out_mem; + } + + if (kread_dup_request_exists_wait(name, &dup_ret)) { + ret = -EBUSY; + goto out_mem; + } + ret = kernel_read_file(f.file, offset, buf, buf_size, file_size, id); + + kread_dup_request_announce(name, ret); + +out_mem: + kfree(path); out: fdput(f); return ret; diff --git a/kernel/module/main.c b/kernel/module/main.c index 1ed373145278..e99419b4d85c 100644 --- a/kernel/module/main.c +++ b/kernel/module/main.c @@ -3080,8 +3080,10 @@ SYSCALL_DEFINE3(finit_module, int, fd, const char __user *, uargs, int, flags) len = kernel_read_file_from_fd(fd, 0, &buf, INT_MAX, NULL, READING_MODULE); if (len < 0) { - mod_stat_inc(&failed_kreads); - mod_stat_add_long(len, &invalid_kread_bytes); + if (len != -EBUSY) { + mod_stat_inc(&failed_kreads); + mod_stat_add_long(len, &invalid_kread_bytes); + } return len; }
With this we run into 0 wasted virtual memory bytes. Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> --- fs/kernel_read_file.c | 150 ++++++++++++++++++++++++++++++++++++++++++ kernel/module/main.c | 6 +- 2 files changed, 154 insertions(+), 2 deletions(-)