diff mbox series

[v3] modules: wait do_free_init correctly

Message ID 20240217081810.4155871-1-changbin.du@huawei.com (mailing list archive)
State New, archived
Headers show
Series [v3] modules: wait do_free_init correctly | expand

Commit Message

Changbin Du Feb. 17, 2024, 8:18 a.m. UTC
The synchronization here is just to ensure the module init's been freed
before doing W+X checking. But the commit 1a7b7d922081 ("modules: Use
vmalloc special flag") moves do_free_init() into a global workqueue
instead of call_rcu(). So now rcu_barrier() can not ensure that do_free_init
has completed. We should wait it via flush_work().

Without this fix, we still could encounter false positive reports in
W+X checking, and the rcu synchronization is unnecessary which can
introduce significant delay.

Eric Chanudet reports that the rcu_barrier introduces ~0.1s delay on a
PREEMPT_RT kernel.
  [    0.291444] Freeing unused kernel memory: 5568K
  [    0.402442] Run /sbin/init as init process

With this fix, the above delay can be eliminated.

Fixes: 1a7b7d922081 ("modules: Use vmalloc special flag")
Signed-off-by: Changbin Du <changbin.du@huawei.com>
Cc: Xiaoyi Su <suxiaoyi@huawei.com>
Cc: Eric Chanudet <echanude@redhat.com>

---
v3:
  - amend comment in do_init_module() and update commit msg.
v2:
  - fix compilation issue for no CONFIG_MODULES found by 0-DAY.
---
 include/linux/moduleloader.h | 8 ++++++++
 init/main.c                  | 5 +++--
 kernel/module/main.c         | 9 +++++++--
 3 files changed, 18 insertions(+), 4 deletions(-)

Comments

Andrew Morton Feb. 18, 2024, 9:21 p.m. UTC | #1
On Sat, 17 Feb 2024 16:18:10 +0800 Changbin Du <changbin.du@huawei.com> wrote:

> The synchronization here is just to ensure the module init's been freed
> before doing W+X checking. But the commit 1a7b7d922081 ("modules: Use
> vmalloc special flag") moves do_free_init() into a global workqueue
> instead of call_rcu(). So now rcu_barrier() can not ensure that do_free_init
> has completed. We should wait it via flush_work().
> 
> Without this fix, we still could encounter false positive reports in
> W+X checking, and the rcu synchronization is unnecessary which can
> introduce significant delay.
> 
> Eric Chanudet reports that the rcu_barrier introduces ~0.1s delay on a
> PREEMPT_RT kernel.
>   [    0.291444] Freeing unused kernel memory: 5568K
>   [    0.402442] Run /sbin/init as init process
> 
> With this fix, the above delay can be eliminated.

Thanks, I'll queue this as a delta, to be folded into the base patch
prior to upstreaming.

I added a Tested-by: Eric, if that's OK by him?
Eric Chanudet Feb. 21, 2024, 4:09 a.m. UTC | #2
On Sun, Feb 18, 2024 at 01:21:53PM -0800, Andrew Morton wrote:
> On Sat, 17 Feb 2024 16:18:10 +0800 Changbin Du <changbin.du@huawei.com> wrote:
> > The synchronization here is just to ensure the module init's been freed
> > before doing W+X checking. But the commit 1a7b7d922081 ("modules: Use
> > vmalloc special flag") moves do_free_init() into a global workqueue
> > instead of call_rcu(). So now rcu_barrier() can not ensure that do_free_init
> > has completed. We should wait it via flush_work().
> > 
> > Without this fix, we still could encounter false positive reports in
> > W+X checking, and the rcu synchronization is unnecessary which can
> > introduce significant delay.
> > 
> > Eric Chanudet reports that the rcu_barrier introduces ~0.1s delay on a
> > PREEMPT_RT kernel.
> >   [    0.291444] Freeing unused kernel memory: 5568K
> >   [    0.402442] Run /sbin/init as init process
> > 
> > With this fix, the above delay can be eliminated.
> 
> Thanks, I'll queue this as a delta, to be folded into the base patch
> prior to upstreaming.
> 
> I added a Tested-by: Eric, if that's OK by him?

Absolutely, I should have put it in my initial reply.
Adding here as confirmation:
Tested-by: Eric Chanudet <echanude@redhat.com>

Thanks,
Luis Chamberlain Feb. 21, 2024, 5:40 p.m. UTC | #3
+ live-patching folks,

Finally, things are starting to be much clearer. Thanks for the time
for working on this, some more comments below and a question which
I think deserves some attention.

On Sat, Feb 17, 2024 at 04:18:10PM +0800, Changbin Du wrote:
> The synchronization here is just to ensure the module init's been freed
> before doing W+X checking. 

Some nits, this should read instead:

Fix the ordering of freeing of a module init so that it happens before
W+X checking.

> But the commit 1a7b7d922081 ("modules: Use
> vmalloc special flag") moves do_free_init() into a global workqueue
> instead of call_rcu(). So now rcu_barrier() can not ensure that do_free_init
> has completed. We should wait it via flush_work().

Remove "But" and adjust as:

Commit 1a7b7d922081 ("modules: Use vmalloc special flag") moved
calling do_free_init() into a global workqueue instead of relying on it
being called through call_rcu(..., do_free_init), which used to allowed us
call do_free_init() asynchronously after the end of a subsequent grace             
period. The move to a global workqueue broke the gaurantees for code
which needed to be sure the do_free_init() would complete with rcu_barrier().
To fix this callers which used to rely on rcu_barrier() must now instead
use flush_work(&init_free_wq).

> Without this fix, we still could encounter false positive reports in
> W+X checking,

This is good thanks for the clarification.

I think it would be useful for the commit log then to describe also that
it is not that the freeing was not happening, it is just that our sanity
checkers raced against the permission checkers which assume init memory
is already gone.

> and the rcu synchronization is unnecessary which can
> introduce significant delay.

While this can be true, I am not sure if we can remove it. See below.

> Eric Chanudet reports that the rcu_barrier introduces ~0.1s delay on a
> PREEMPT_RT kernel.

That's a separate issue.

>   [    0.291444] Freeing unused kernel memory: 5568K
>   [    0.402442] Run /sbin/init as init process
> 
> With this fix, the above delay can be eliminated.
> 
> Fixes: 1a7b7d922081 ("modules: Use vmalloc special flag")
> Signed-off-by: Changbin Du <changbin.du@huawei.com>
> Cc: Xiaoyi Su <suxiaoyi@huawei.com>
> Cc: Eric Chanudet <echanude@redhat.com>
> 
> ---
> v3:
>   - amend comment in do_init_module() and update commit msg.
> v2:
>   - fix compilation issue for no CONFIG_MODULES found by 0-DAY.
> ---
>  include/linux/moduleloader.h | 8 ++++++++
>  init/main.c                  | 5 +++--
>  kernel/module/main.c         | 9 +++++++--
>  3 files changed, 18 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/moduleloader.h b/include/linux/moduleloader.h
> index 001b2ce83832..89b1e0ed9811 100644
> --- a/include/linux/moduleloader.h
> +++ b/include/linux/moduleloader.h
> @@ -115,6 +115,14 @@ int module_finalize(const Elf_Ehdr *hdr,
>  		    const Elf_Shdr *sechdrs,
>  		    struct module *mod);
>  
> +#ifdef CONFIG_MODULES
> +void flush_module_init_free_work(void);
> +#else
> +static inline void flush_module_init_free_work(void)
> +{
> +}
> +#endif
> +
>  /* Any cleanup needed when module leaves. */
>  void module_arch_cleanup(struct module *mod);
>  
> diff --git a/init/main.c b/init/main.c
> index e24b0780fdff..f0b7e21ac67f 100644
> --- a/init/main.c
> +++ b/init/main.c
> @@ -99,6 +99,7 @@
>  #include <linux/init_syscalls.h>
>  #include <linux/stackdepot.h>
>  #include <linux/randomize_kstack.h>
> +#include <linux/moduleloader.h>
>  #include <net/net_namespace.h>
>  
>  #include <asm/io.h>
> @@ -1402,11 +1403,11 @@ static void mark_readonly(void)
>  	if (rodata_enabled) {
>  		/*
>  		 * load_module() results in W+X mappings, which are cleaned
> -		 * up with call_rcu().  Let's make sure that queued work is
> +		 * up with init_free_wq. Let's make sure that queued work is
>  		 * flushed so that we don't hit false positives looking for
>  		 * insecure pages which are W+X.
>  		 */
> -		rcu_barrier();

Was this the only source of waiters that used rcu_barrier() to sync ?
What about kallsyms, live-patching ?

This original source to the addition of this rcu_barrier() (in a slight
older modified form with with rcu_barrier_sched()) was commit
ae646f0b9ca13 ("init: fix false positives in W+X checking") since
v4.17 in 2018, 6 years ago. So I'm hoping we don't have any other
side-by new users which have grown dependent on this rcu_barrier() for
other call_rcu()'s they may have used, but it is hard to tell.

So while I agree that flush work is the right solution, removing the
rcu_barrier() is technically another change which could potentially
regress for other reasons now. It is perhaps safe, but I'm used to
surprises for minor changes like these. So I think it makes sense to
lift it now, and test it in the wild to see what could possibly break,
I'd much prefer to split this as two separate commits. One which does
the fix, and another that lifts the rcu_barrier() with the stated
rationale and savings on time of ~0.1s on PREEMPT_RT kernels.

  Luis
Changbin Du Feb. 22, 2024, 9:21 a.m. UTC | #4
On Wed, Feb 21, 2024 at 09:40:48AM -0800, Luis Chamberlain wrote:
> + live-patching folks,
> 
> Finally, things are starting to be much clearer. Thanks for the time
> for working on this, some more comments below and a question which
> I think deserves some attention.
> 
> On Sat, Feb 17, 2024 at 04:18:10PM +0800, Changbin Du wrote:
> > The synchronization here is just to ensure the module init's been freed
> > before doing W+X checking. 
> 
> Some nits, this should read instead:
> 
> Fix the ordering of freeing of a module init so that it happens before
> W+X checking.
> 
> > But the commit 1a7b7d922081 ("modules: Use
> > vmalloc special flag") moves do_free_init() into a global workqueue
> > instead of call_rcu(). So now rcu_barrier() can not ensure that do_free_init
> > has completed. We should wait it via flush_work().
> 
> Remove "But" and adjust as:
> 
> Commit 1a7b7d922081 ("modules: Use vmalloc special flag") moved
> calling do_free_init() into a global workqueue instead of relying on it
> being called through call_rcu(..., do_free_init), which used to allowed us
> call do_free_init() asynchronously after the end of a subsequent grace             
> period. The move to a global workqueue broke the gaurantees for code
> which needed to be sure the do_free_init() would complete with rcu_barrier().
> To fix this callers which used to rely on rcu_barrier() must now instead
> use flush_work(&init_free_wq).
>
Sure, thanks!

> > Without this fix, we still could encounter false positive reports in
> > W+X checking,
> 
> This is good thanks for the clarification.
> 
> I think it would be useful for the commit log then to describe also that
> it is not that the freeing was not happening, it is just that our sanity
> checkers raced against the permission checkers which assume init memory
> is already gone.
> 
okay, I'll apend this detailed explanation.

> > and the rcu synchronization is unnecessary which can
> > introduce significant delay.
> 
> While this can be true, I am not sure if we can remove it. See below.
> 
> > Eric Chanudet reports that the rcu_barrier introduces ~0.1s delay on a
> > PREEMPT_RT kernel.
> 
> That's a separate issue.
> 
> >   [    0.291444] Freeing unused kernel memory: 5568K
> >   [    0.402442] Run /sbin/init as init process
> > 
> > With this fix, the above delay can be eliminated.
> > 
> > Fixes: 1a7b7d922081 ("modules: Use vmalloc special flag")
> > Signed-off-by: Changbin Du <changbin.du@huawei.com>
> > Cc: Xiaoyi Su <suxiaoyi@huawei.com>
> > Cc: Eric Chanudet <echanude@redhat.com>
> > 
> > ---
> > v3:
> >   - amend comment in do_init_module() and update commit msg.
> > v2:
> >   - fix compilation issue for no CONFIG_MODULES found by 0-DAY.
> > ---
> >  include/linux/moduleloader.h | 8 ++++++++
> >  init/main.c                  | 5 +++--
> >  kernel/module/main.c         | 9 +++++++--
> >  3 files changed, 18 insertions(+), 4 deletions(-)
> > 
> > diff --git a/include/linux/moduleloader.h b/include/linux/moduleloader.h
> > index 001b2ce83832..89b1e0ed9811 100644
> > --- a/include/linux/moduleloader.h
> > +++ b/include/linux/moduleloader.h
> > @@ -115,6 +115,14 @@ int module_finalize(const Elf_Ehdr *hdr,
> >  		    const Elf_Shdr *sechdrs,
> >  		    struct module *mod);
> >  
> > +#ifdef CONFIG_MODULES
> > +void flush_module_init_free_work(void);
> > +#else
> > +static inline void flush_module_init_free_work(void)
> > +{
> > +}
> > +#endif
> > +
> >  /* Any cleanup needed when module leaves. */
> >  void module_arch_cleanup(struct module *mod);
> >  
> > diff --git a/init/main.c b/init/main.c
> > index e24b0780fdff..f0b7e21ac67f 100644
> > --- a/init/main.c
> > +++ b/init/main.c
> > @@ -99,6 +99,7 @@
> >  #include <linux/init_syscalls.h>
> >  #include <linux/stackdepot.h>
> >  #include <linux/randomize_kstack.h>
> > +#include <linux/moduleloader.h>
> >  #include <net/net_namespace.h>
> >  
> >  #include <asm/io.h>
> > @@ -1402,11 +1403,11 @@ static void mark_readonly(void)
> >  	if (rodata_enabled) {
> >  		/*
> >  		 * load_module() results in W+X mappings, which are cleaned
> > -		 * up with call_rcu().  Let's make sure that queued work is
> > +		 * up with init_free_wq. Let's make sure that queued work is
> >  		 * flushed so that we don't hit false positives looking for
> >  		 * insecure pages which are W+X.
> >  		 */
> > -		rcu_barrier();
> 
> Was this the only source of waiters that used rcu_barrier() to sync ?
> What about kallsyms, live-patching ?
> 
> This original source to the addition of this rcu_barrier() (in a slight
> older modified form with with rcu_barrier_sched()) was commit
> ae646f0b9ca13 ("init: fix false positives in W+X checking") since
> v4.17 in 2018, 6 years ago. So I'm hoping we don't have any other
> side-by new users which have grown dependent on this rcu_barrier() for
> other call_rcu()'s they may have used, but it is hard to tell.
> 
Per the condtion 'rodata_enabled' and comments, I think the rcu_barrier() is
only used to synchronize with freeing module init memory.

> So while I agree that flush work is the right solution, removing the
> rcu_barrier() is technically another change which could potentially
> regress for other reasons now. It is perhaps safe, but I'm used to
> surprises for minor changes like these. So I think it makes sense to
> lift it now, and test it in the wild to see what could possibly break,
> I'd much prefer to split this as two separate commits. One which does
> the fix, and another that lifts the rcu_barrier() with the stated
> rationale and savings on time of ~0.1s on PREEMPT_RT kernels.
>
But the only change in patch is to replace rcu_barrier() with flush_module_init_free_work().

Do you mean that keep both flush_module_init_free_work() and rcu_barrier() here?
It sounds a little bit weird IMHO.

>   Luis
diff mbox series

Patch

diff --git a/include/linux/moduleloader.h b/include/linux/moduleloader.h
index 001b2ce83832..89b1e0ed9811 100644
--- a/include/linux/moduleloader.h
+++ b/include/linux/moduleloader.h
@@ -115,6 +115,14 @@  int module_finalize(const Elf_Ehdr *hdr,
 		    const Elf_Shdr *sechdrs,
 		    struct module *mod);
 
+#ifdef CONFIG_MODULES
+void flush_module_init_free_work(void);
+#else
+static inline void flush_module_init_free_work(void)
+{
+}
+#endif
+
 /* Any cleanup needed when module leaves. */
 void module_arch_cleanup(struct module *mod);
 
diff --git a/init/main.c b/init/main.c
index e24b0780fdff..f0b7e21ac67f 100644
--- a/init/main.c
+++ b/init/main.c
@@ -99,6 +99,7 @@ 
 #include <linux/init_syscalls.h>
 #include <linux/stackdepot.h>
 #include <linux/randomize_kstack.h>
+#include <linux/moduleloader.h>
 #include <net/net_namespace.h>
 
 #include <asm/io.h>
@@ -1402,11 +1403,11 @@  static void mark_readonly(void)
 	if (rodata_enabled) {
 		/*
 		 * load_module() results in W+X mappings, which are cleaned
-		 * up with call_rcu().  Let's make sure that queued work is
+		 * up with init_free_wq. Let's make sure that queued work is
 		 * flushed so that we don't hit false positives looking for
 		 * insecure pages which are W+X.
 		 */
-		rcu_barrier();
+		flush_module_init_free_work();
 		mark_rodata_ro();
 		rodata_test();
 	} else
diff --git a/kernel/module/main.c b/kernel/module/main.c
index 36681911c05a..b0b99348e1a8 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -2489,6 +2489,11 @@  static void do_free_init(struct work_struct *w)
 	}
 }
 
+void flush_module_init_free_work(void)
+{
+	flush_work(&init_free_wq);
+}
+
 #undef MODULE_PARAM_PREFIX
 #define MODULE_PARAM_PREFIX "module."
 /* Default value for module->async_probe_requested */
@@ -2593,8 +2598,8 @@  static noinline int do_init_module(struct module *mod)
 	 * Note that module_alloc() on most architectures creates W+X page
 	 * mappings which won't be cleaned up until do_free_init() runs.  Any
 	 * code such as mark_rodata_ro() which depends on those mappings to
-	 * be cleaned up needs to sync with the queued work - ie
-	 * rcu_barrier()
+	 * be cleaned up needs to sync with the queued work by invoking
+	 * flush_module_init_free_work().
 	 */
 	if (llist_add(&freeinit->node, &init_free_list))
 		schedule_work(&init_free_wq);